from:"Austin S. Hemmelgarn"

On 2014-06-16 03:54, Swâmi Petaramesh wrote:
 Hi,
 
 I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I 
 know...], and I noticed that the FS got created with metadata in DUP mode, 
 contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be 
 SINGLE...
 
 Well I don't know if my system didn't identify the SSD because of the 
 LVM+LUKS 
 stack (however it mounts well by itself with the ssd flag and accepts the 
 discard option [yes, I know...]), or if the manpage is obsolete or if this 
 feature just doesn't work...?
 
 The SSD being a Micron RealSSD C400
 
 For both SSD preservation and data integrity, would it be advisable to change 
 metadata to SINGLE using a rebalance, or if I'd better just leave things 
 the 
 way they are...?
 
 TIA for any insight.
 
What mkfs.btrfs looks at is
/sys/block/whatever-device/queue/rotational, if that is 1 it knows
that the device isn't a SSD.  I believe that LVM passes through whatever
the next lower layer's value is, but dmcrypt (and by extension LUKS)
always force it to a 1 (possibly to prevent programs from using
heuristics for enabling discard)



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

On 2014-06-16 06:35, Russell Coker wrote:
 On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote:
 On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:
 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,
 fallocate() isn't necessarily supposed to write anything really, it's
 mostly about allocating disk space in advance. I would claim that
 journald's usage of it is very much within the entire reason why it
 exists...

 I don't believe that fallocate() makes any difference to fragmentation on
 BTRFS.  Blocks will be allocated when writes occur so regardless of an
 fallocate() call the usage pattern in systemd-journald will cause
 fragmentation.

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...
 
 Not being too different from the access patterns of other databases means 
 having all the same problems as other databases...  Oracle is now selling ZFS 
 servers specifically designed for running the Oracle database, but that's 
 with 
 hybrid storage flash (ZIL and L2ARC on SSD).  While BTRFS doesn't support 
 features equivalent for ZIL and L2ARC it's easy to run a separate filesystem 
 on SSD for things that need performance (few if any current BTRFS users would 
 have a database too big to entirely fit on a SSD).
 
 The problem we are dealing with is database-like access patterns on systems 
 that are not designed as database servers.
 
 Would it be possible to get an interface for defragmenting files that's not 
 specific to BTRFS?  If we had a standard way of doing this then systemd-
 journald could request a defragment of the file at appropriate times.
 
While this is a wonderful idea, what about all the extra I/O this will
cause (and all the extra wear on SSD's)?  While I understand wanting
this to be faster, you should also consider the fact that defragmenting
the file on a regular basis is going to trash performance for other
applications.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: BTRFS, SSD and single metadata

On 2014-06-16 07:18, Swâmi Petaramesh wrote:
 Hi Austin, and thanks for your reply.
 
 Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit :

 What mkfs.btrfs looks at is
 /sys/block/whatever-device/queue/rotational, if that is 1 it knows
 that the device isn't a SSD.  I believe that LVM passes through whatever
 the next lower layer's value is, but dmcrypt (and by extension LUKS)
 always force it to a 1 (possibly to prevent programs from using
 heuristics for enabling discard)
 
 In the current running condition, the system clearly sees this is *not* 
 rotational, even thru the LVM/dmcrypt stack :
 
 # mount | grep btrfs
 /dev/mapper/VG-LINUX on / type btrfs 
 (rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag)
 
 # ll /dev/mapper/VGV-LINUX
 lrwxrwxrwx. 1 root root 7 16 juin  09:21 /dev/mapper/VG-LINUX - ../dm-1
 
 # cat /sys/block/dm-1/queue/rotational 
 0
 
 ...However, at mkfs.btrfs time, it migth well not have seen it, as I made it 
 from a live USB key in which both the lvm.conf and crypttab had not been 
 taylored to allow trim commands...
 
 However, now that the FS is created, I still wonder whether I should use a 
 rebalance to change the metadata from DUP to SINGLE, or if Id' better stay 
 with DUP...
 
 Kind regards.
 
 
I'd personally stay with the DUP profile, but then that's just me being
paranoid.  You will almost certainly get better performance using the
SINGLE profile instead of DUP, but this is mostly due to it requiring
fewer blocks to be encrypted by LUKS (Which is almost certainly your
primary bottleneck unless you have some high-end crypto-accelerator card).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

On 06/16/2014 03:52 PM, Martin wrote:
 On 16/06/14 17:05, Josef Bacik wrote:

 On 06/16/2014 03:14 AM, Lennart Poettering wrote:
 On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:

 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,
 
 I don't believe that fallocate() makes any difference to
 fragmentation on
 BTRFS.  Blocks will be allocated when writes occur so regardless of an
 fallocate() call the usage pattern in systemd-journald will cause
 fragmentation.

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...
 
 Even though this appears to be a problem case for btrfs/COW, is there a
 more favourable write/access sequence possible that is easily
 implemented that is favourable for both ext4-like fs /and/ COW fs?
 
 Database-like writing is known 'difficult' for filesystems: Can a data
 log can be a simpler case?
 
 
 Was waiting for you to show up before I said anything since most systemd
 related emails always devolve into how evil you are rather than what is
 actually happening.
 
 Ouch! Hope you two know each other!! :-P :-)
 
 
 [...]
 since we shouldn't be fragmenting this badly.

 Like I said what you guys are doing is fine, if btrfs falls on it's face
 then its not your fault.  I'd just like an exact idea of when you guys
 are fsync'ing so I can replicate in a smaller way.  Thanks,
 
 Good if COW can be so resilient. I have about 2GBytes of data logging
 files and I must defrag those as part of my backups to stop the system
 fragmenting to a stop (I use cp -a to defrag the files to a new area
 and restart the data software logger on that).
 
 
 Random thoughts:
 
 Would using a second small file just for the mmap-ed pointers help avoid
 repeated rewriting of random offsets in the log file causing excessive
 fragmentation?
 
 Align the data writes to 16kByte or 64kByte boundaries/chunks?
 
 Are mmap-ed files a similar problem to using a swap file and so should
 the same btrfs file swap code be used for both?
 
 
 Not looked over the code so all random guesses...
 
 Regards,
 Martin
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
Just a thought, partly inspired by the mention of the swap code, has
anyone tried making the file NOCOW and pre-allocating to the max journal
size?  A similar approach has seemed to help on my systems with generic
log files (I keep debug level logs from almost everything, so I end up
with very active log files with ridiculous numbers of fragments if I
don't pre-allocate and mark them NOCOW).  I don't know for certain how
BTRFS handles appends to NOCOW files, but I would be willing to bet that
it ends up with a new fragment for each filesystem block worth of space
allocated.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: btrfs on whole disk (no partitions)

2014-06-19 Thread Austin S Hemmelgarn

On 2014-06-18 16:10, Chris Murphy wrote:
 
 On Jun 18, 2014, at 1:29 PM, Daniel Cegiełka daniel.cegie...@gmail.com 
 wrote:
 
 Hi,
 I created btrfs directly to disk using such a scheme (no partitions):

 dd if=/dev/zero of=/dev/sda bs=4096
 mkfs.btrfs -L dev_sda /dev/sda
 mount /dev/sda /mnt

 cd /mnt
 btrfs subvolume create __active
 btrfs subvolume create __active/rootvol
 btrfs subvolume create __active/usr
 btrfs subvolume create __active/home
 btrfs subvolume create __active/var
 btrfs subvolume create __snapshots

 cd /
 umount /mnt
 mount -o subvol=__active/rootvol /dev/sda /mnt
 mkdir /mnt/{usr,home,var}
 mount -o subvol=__active/usr /dev/sda /mnt/usr
 mount -o subvol=__active/home /dev/sda /mnt/home
 mount -o subvol=__active/var /dev/sda /mnt/var

 # /etc/fstab
 UID=ID/btrfs rw,relative,space_cache,subvol=__active/rootvol0 0
 UUID=ID/usrbtrfs rw,relative,space_cache,subvol=__active/usr0 0
 UUID=ID/homebtrfs rw,relative,space_cache,subvol=__active/home0 0
 UUID=ID/varbtrfs rw,relative,space_cache,subvol=__active/var0 0
 
 rw and space_cache are redundant because they are default; and relative is 
 not a valid mount option. All you need is subvol= 
 
 Everything works fine. Is such a solution is recommended? In my
 opinion, the creation of the partitions seems to be completely
 unnecessary if you can use btrfs.
 
 It's firmware specific. Some BIOS firmwares will want to see a valid MBR 
 partition map at LBA 0, not just boot code. Others only care to blindly 
 execute the boot code which would be put in the Btrfs bootloader pad (64KB). 
 I don't know if parted 3.1 recognizes partitionless disks with Btrfs though 
 so it might slightly increase the risk that it's treated as something other 
 than what it is.
 
 For UEFI firmware, it would definitely need to be partitioned since an EFI 
 System partition is required.
 
 Chris Murphy--
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
On most hardware, I would definitely suggest at least adding a minimal
sized partition table, the people who design the BIOS code on most
systems make too many assumptions to trust their code to work correctly.
 That said, I regularly use BTRFS on flat devices for the root
filesystems for Xen PV Guest systems, systems that boot from SAN, and
secondary disks on other systems with no issues whatsoever.



smime.p7s
Description: S/MIME Cryptographic Signature

Questions about BTRFS_IOC_FILE_EXTENT_SAME

2014-06-19 Thread Austin S Hemmelgarn

I have a few questions about the BTRFS_IOC_FILE_EXTENT_SAME ioctl, and
was hoping that I could get answers here without having to go source
diving or trying to test things myself:

1. What kind of overhead is there when it is called on a group of
extents that aren't actually the same (aside from the obvious pair of
context-switches that are required for an ioctl)?  I would think that it
would bail at the first difference it finds, but I have learned that
when it comes to kernel code, just because something seems obvious
doesn't mean that's how it's done.

2. Does it matter if the ranges passed in are actual extents, or can
they be arbitrary ranges of equal bytes in the files?

3. What happens if one of the ranges is truncated by the end of a file?
 IOW, if I have files A and B, and file A is longer than file B, and
file B is identical to the start of file A, what happens if I pass in
both files starting at offset 0, but pass the length of file A instead
of passing in the length of file B?

4. Does it matter if one of the extents passed in is compressed and the
other is not?

Thanks in advance.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: -d single for data blocks on a multiple devices doesn't work as it should

2014-06-24 Thread Austin S Hemmelgarn

 I somehow have doubts that a complex filesystem is the right project for
 me to start learning C, so I'll have to pass :-) No huge corporation
 with that itch behind me either, and I guess it will be more than a few
 hours for a btrfs programmer so no way I could sponsor that on my own.

Whether or not it is the right project really depends on where you
intend to do most of your C programming.  If you plan to do most of it
in kernel code and occasional userspace wrappers for kernel interfaces
(like me), then it could be a great place because it's under such heavy
development (which means more developers are working on it, and bugs get
spotted faster, both of which are good things for project you are using
to learn a language).  If, however, you intend to use it mostly for
userspace, then I would definitely agree with you, programming in
userspace and kernel-space are so different that it's almost like a
different language using the same syntax and similar semantics.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [Question] Btrfs on iSCSI device

2014-06-27 Thread Austin S Hemmelgarn

On 2014-06-27 12:34, Goffredo Baroncelli wrote:
 Hi,
 On 06/27/2014 05:44 PM, Zhe Zhang wrote:
 Hi,

 I setup 2 Linux servers to share the same device through iSCSI. Then I
 created a btrfs on the device. Then I saw the problem that the 2 Linux
 servers do not see a consistent file system image.

 Details:
 -- Server 1 running kernel 2.6.32, server 2 running 3.2.1
 -- Both running btrfs v0.20-rc1
 -- Server 2 has device /dev/vdc, exposed as iSCSI target
  -- Server 1 mounts the device as /dev/sda
 -- Server 1 'mount /dev/sda /mnt/btrfs'; server 2 'mount /dev/vdc 
 /mnt/btrfs',
  -- When server 1 'touch /mnt/btrfs/foo', server 2 doesn't see any
 file under /mnt/btrfs
 -- I created /mnt/btrfs/foo on server 2 as well; then I added some
 content from both server 1 and server 2 to /mnt/btrfs/foo
 -- After that each server sees the content it adds, but not the
 content from the other server
 -- Both server 'umount /mnt/btrfs', and mount it again
 -- Then both servers see /mnt/btrfs/foo with the content added from
 server 2 (I guess it's because server 2 created the foo file later
 than server 1).

 I did a similar test on ext4 and both servers see a consistent image
 of the file system. When server 1 creates a foo file server 2
 immediately sees it.

 Is this how btrfs is supposed to work?
 
 I don't think that it is possible to mount the _same device_ at the _same 
 time_ on two different machines. And this doesn't depend by the filesystem.
 
 The fact that you see it working, I suspect that is is casual.
 
 When I tried this (same scsi HD connected to two machines), I had to ensure 
 that the two machines never accessed to the HD at the same time.
 

 Thanks,

 Zhe
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 
If you need shared storage like that, you need to use a real cluster
filesystem like GFS2 or OCFS2, BTRFS isn't designed for any kind of
concurrent access to shared storage from separate systems.
The reason it appears to work when using iSCSI and not with directly
connected parallel SCSI or SAS is that iSCSI doesn't provide low level
hardware access.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [Question] Btrfs on iSCSI device

2014-06-27 Thread Austin S Hemmelgarn

On 06/27/2014 07:40 PM, Russell Coker wrote:
 On Fri, 27 Jun 2014 18:34:34 Goffredo Baroncelli wrote:
 I don't think that it is possible to mount the _same device_ at the _same
 time_ on two different machines. And this doesn't depend by the filesystem.
 
 If you use a clustered filesystem then you can safely mount it on multiple 
 machines.
 
 If you use a non-clustered filesystem it can still mount and even appear to 
 work for a while.  It's surprising how many writes you can make to a dual-
 mounted filesystem that's not designed for such things before you get a 
 totally broken filesystem.
 
 On Fri, 27 Jun 2014 13:15:16 Austin S Hemmelgarn wrote:
 The reason it appears to work when using iSCSI and not with directly
 connected parallel SCSI or SAS is that iSCSI doesn't provide low level
 hardware access.
 
 I've tried this with dual-attached FC and had no problems mounting.  In what 
 way is directly connected SCSI different from FC?
 
FC is actually it's own networking stack (and you can even run (in
theory) other protocols like IP and ATM on top of it), whereas parallel
SCSI is just a multi-drop bus, and SAS is just a tree-structured bus
with point-to-point communications emulated on top of it.  In other
words, parallel SCSI has topological constraints like RS-485, SAS has
topology constraints like USB, and FC has topology constraints like
Ethernet.

Secondarily, most filesystems on Linux will let you mount them multiple
times on separate hosts (ext4 has features to prevent this, but they are
expensive and therefore turned off by default, I think XFS might have
similar features, but I'm not sure).  BTRFS should in theory be more
resilient than most because of the COW nature (as long as it's only a
few commit cycles, you should still be able to recover most of the data
just fine).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: mount time of multi-disk arrays

2014-07-07 Thread Austin S Hemmelgarn

On 2014-07-07 09:54, Konstantinos Skarlatos wrote:
 On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
 Hello List,

 can anyone tell me how much time is acceptable and assumable for a
 multi-disk btrfs array with classical hard disk drives to mount?

 I'm having a bit of trouble with my current systemd setup, because it
 couldn't mount my btrfs raid anymore after adding the 5th drive. With
 the 4 drive setup it failed to mount once in a few times. Now it fails
 everytime because the default timeout of 1m 30s is reached and mount is
 aborted.
 My last 10 manual mounts took between 1m57s and 2m12s to finish.
 I have the exact same problem, and have to manually mount my large
 multi-disk btrfs filesystems, so I would be interested in a solution as
 well.
 

 My hardware setup contains a
 - Intel Core i7 4770
 - Kernel 3.15.2-1-ARCH
 - 32GB RAM
 - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
 - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

 Thanks in advance

 André-Sebastian Liebe
 --


 # btrfs fi sh
 Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
  Total devices 5 FS bytes used 14.21TiB
  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
  devid4 size 3.64TiB used 2.86TiB path /dev/sde
  devid5 size 3.64TiB used 2.88TiB path /dev/sdb

 Btrfs v3.14.2-dirty

 # btrfs fi df /data/pool0/
 Data, single: total=14.28TiB, used=14.19TiB
 System, RAID1: total=8.00MiB, used=1.54MiB
 Metadata, RAID1: total=26.00GiB, used=20.20GiB
 unknown, single: total=512.00MiB, used=0.00

This is interesting, I actually did some profiling of the mount timings
for a bunch of different configurations of 4 (identical other than
hardware age) 1TB Seagate disks.  One of the arrangements I tested was
Data using single profile and Metadata/System using RAID1.  Based on the
results I got, and what you are reporting, the mount time doesn't scale
linearly in proportion to the amount of storage space.

You might want to try the RAID10 profile for Metadata, of the
configurations I tested, the fastest used Single for Data and RAID10 for
Metadata/System.

Also, based on the System chunk usage, I'm guessing that you have a LOT
of subvolumes/snapshots, and I do know that having very large (100+)
numbers of either does slow down the mount command (I don't think that
we cache subvolume information between mount invocations, so it has to
re-parse the system chunks for each individual mount).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-10 Thread Austin S Hemmelgarn

On 2014-07-09 22:10, Russell Coker wrote:
 On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote:
 - for someone using SAS or enterprise SATA drives with Linux, I
 understand btrfs gives the extra benefit of checksums, are there any
 other specific benefits over using mdadm or dmraid?

 I think I can answer this one.

 Most important advantage I think is BTRFS is aware of which blocks of the
 RAID are in use and need to be synced:

 - Instant initialization of RAID regardless of size (unless at some
 capacity mkfs.btrfs needs more time)
 
 From mdadm(8):
 
--assume-clean
   Tell mdadm that the array pre-existed and is known to be  clean.
   It  can be useful when trying to recover from a major failure as
   you can be sure that no data will be affected unless  you  actu‐
   ally  write  to  the array.  It can also be used when creating a
   RAID1 or RAID10 if you want to avoid the initial resync, however
   this  practice  — while normally safe — is not recommended.  Use
   this only if you really know what you are doing.
 
   When the devices that will be part of a new  array  were  filled
   with zeros before creation the operator knows the array is actu‐
   ally clean. If that is the case,  such  as  after  running  bad‐
   blocks,  this  argument  can be used to tell mdadm the facts the
   operator knows.
 
 While it might be regarded as a hack, it is possible to do a fairly instant 
 initialisation of a Linux software RAID-1.

This has the notable disadvantage however that the first scrub you run
will essentially preform a full resync if you didn't make sure that the
disks had identical data to begin with.
 - Rebuild after disk failure or disk replace will only copy *used* blocks
 
 Have you done any benchmarks on this?  The down-side of copying used blocks 
 is 
 that you first need to discover which blocks are used.  Given that seek time 
 is 
 a major bottleneck at some portion of space used it will be faster to just 
 copy the entire disk.
 
 I haven't done any tests on BTRFS in this regard, but I've seen a disk 
 replacement on ZFS run significantly slower than a dd of the block device 
 would.
 
First of all, this isn't really a good comparison for two reasons:
1. EVERYTHING on ZFS (or any filesystem that tries to do that much work)
is slower than a dd of the raw block device.
2. Even if the throughput is lower, this is only really an issue if the
disk is more than half full, because you don't copy the unused blocks

Also, while it isn't really a recovery situation, I recently upgraded
from a 2 1TB disk BTRFS RAID1 setup to a 4 1TB disk BTRFS RAID10 setup,
and the performance of the re-balance really wasn't all that bad.  I
have maybe 100GB of actual data, so the array started out roughly 10%
full, and the re-balance only took about 2 minutes.  Of course, it
probably helps that I make a point to keep my filesystems de-fragmented,
scrub and balance regularly, and don't use a lot of sub-volumes or
snapshots, so the filesystem in question is not too different from what
it would have looked like if I had just wiped the FS and restored from a
backup.
 Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID
 should be able to do this as well. But also for scrubbing: BTRFS only
 check and repairs used blocks.
 
 When you scrub Linux Software RAID (and in fact pretty much every RAID) it 
 will only correct errors that the disks flag.  If a disk returns bad data and 
 says that it's good then the RAID scrub will happily copy the bad data over 
 the good data (for a RAID-1) or generate new valid parity blocks for bad data 
 (for RAID-5/6).
 
 http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 Page 12 of the above document says that nearline disks (IE the ones people 
 like me can afford for home use) have a 0.466% incidence of returning bad 
 data 
 and claiming it's good in a year.  Currently I run about 20 such disks in a 
 variety of servers, workstations, and laptops.  Therefore the probability of 
 having no such errors on all those disks would be .99534^20=.91081.  The 
 probability of having no such errors over a period of 10 years would be 
 (.99534^20)^10=.39290 which means that over 10 years I should expect to have 
 such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are 
 necessary features.
 




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.

2014-07-10 Thread Austin S Hemmelgarn

On 07/10/2014 07:32 PM, Tomasz Kusmierz wrote:
 Hi all !
 
 So it been some time with btrfs, and so far I was very pleased, but
 since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
 occur (YES I know this might be unrelated).
 
 So in the past I've had problems with btrfs which turned out to be a
 problem caused by static from printer generating some corruption in
 ram causing checksum failures on the file system - so I'm not going to
 assume that there is something wrong with btrfs from the start.
 
 Anyway:
 On my server I'm running 6 x 2TB disk in raid 10 for general storage
 and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after
 upgrading to 14.04 I've started using Own Cloud which uses Apache 
 MySql for backing store - all data stored on storage array, mysql was
 on system array.
 
 All started with csum errors showing up in mysql data files and in
 some transactions !!!. Generally system imidiatelly was switching to
 all btrfs read only mode due to being forced by kernel (don't have
 dmesg / syslog now). Removed offending files, problem seemed to go
 away and started from scratch. After 5 days problem reapered and now
 was located around same mysql files and in files managed by apache as
 cloud. At this point since these files are rather dear to me I've
 decided to pull all stops and try to rescue as much as I can.
 
 As a excercise in btrfs managment I've run btrfsck --repair - did not
 help. Repeated with --init-csum-tree - turned out that this left me
 with blank system array. Nice ! could use some warning here.
 
I know that this will eventually be pointed out by somebody, so I'm
going to save them the trouble and mention that it does say on both the
wiki and in the manpages that btrfsck should be a last-resort (ie, after
you have made sure you have backups of anything on the FS).
 I've moved all drives and move those to my main rig which got a nice
 16GB of ecc ram, so errors of ram, cpu, controller should be kept
 theoretically eliminated. I've used system array drives and spare
 drive to extract all dear to me files to newly created array (1tb +
 500GB + 640GB). Runned a scrub on it and everything seemed OK. At this
 point I've deleted dear to me files from storage array and ran  a
 scrub. Scrub now showed even more csum errors in transactions and one
 large file that was not touched FOR VERY LONG TIME (size ~1GB).
 Deleted file. Ran scrub - no errors. Copied dear to me files back to
 storage array. Ran scrub - no issues. Deleted files from my backup
 array and decided to call a day. Next day I've decided to run a scrub
 once more just to be sure this time it discovered a myriad of errors
 in files and transactions. Since I've had no time to continue decided
 to postpone on next day - next day I've started my rig and noticed
 that both backup array and storage array does not mount anymore. I was
 attempting to rescue situation without any luck. Power cycled PC and
 on next startup both arrays failed to mount, when I tried to mount
 backup array mount told me that this specific uuid DOES NOT EXIST
 !?!?!
 
 my fstab uuid:
 fcf23e83-f165-4af0-8d1c-cd6f8d2788f4
 new uuid:
 771a4ed0-5859-4e10-b916-07aec4b1a60b
 
 
 tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
 did mount as well. Scrub passes with flying colours on backup array
 while storage array still fails to mount with:
 
 root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/
 mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail  or so
 
 for any device in the array.
 
 Honestly this is a question to more senior guys - what should I do now ?
 
 Chris Mason - have you got any updates to your old friend stress.sh
 ? If not I can try using previous version that you provided to stress
 test my system - but I this is a second system that exposes this
 erratic behaviour.
 
 Anyone - what can I do to rescue my bellowed files (no sarcasm with
 zfs / ext4 / tapes / DVDs)
 
 ps. needles to say: SMART - no sata CRC errors, no relocated sectors,
 no errors what so ever (as much as I can see).
First thing that I would do is some very heavy testing with tools like
iozone and fio.  I would use the verify mode from iozone to further
check data integrity.  My guess based on what you have said is that it
is probably issues with either the storage controller (I've had issues
with almost every brand of SATA controller other than Intel, AMD, Via,
and Nvidia, and it almost always manifested as data corruption under
heavy load), or something in the disk's firmware.  I would still suggest
double-checking your RAM with Memtest, and check the cables on the
drives.  The one other thing that I can think of is potential voltage
sags from the PSU (either because the PSU is overloaded at times, or
because of really noisy/poorly-conditioned line power).  Of course, I
may be

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Austin S Hemmelgarn

On 07/20/2014 10:00 AM, Tomasz Torcz wrote:
 On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote:
 TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:

 One week for a raid10 rebuild 4x3TB drives is a very long time.
 Any thoughts?
 Can you share any statistics from your RAID10 rebuilds?


 At a week, that's nearly 5 MiB per second, which isn't great, but isn't 
 entirely out of the realm of reason either, given all the processing it's 
 doing.  A day would be 33.11+, reasonable thruput for a straight copy, 
 and a raid rebuild is rather more complex than a straight copy, so...
 
   Uhm, sorry, but 5MBps is _entirely_ unreasonable.  It is order-of-magnitude
 unreasonable.  And all the processing shouldn't even show as a blip
 on modern CPUs.
   This speed is undefendable.
 
I wholly agree that it's undefendable, but I can tell you why it is so
slow, it's not 'all the processing' (which is maybe a few hundred
instructions on x86 for each block), it's because BTRFS still serializes
writes to devices, instead of queuing all of them in parallel (that is,
when there are four devices that need written to, it writes to each one
in sequence, waiting for the previous write to finish before dispatching
the next write).  Personally, I would love to see this behavior
improved, but I really don't have any time to work on it myself.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted.

2014-07-26 Thread Austin S Hemmelgarn

On 07/24/2014 05:28 PM, Chris Mason wrote:
 
 
 On 06/26/2014 11:53 PM, Qu Wenruo wrote:
 Current btrfs will only use the first superblock, making the backup
 superblocks only useful for 'btrfs rescue super' command.

 The old problem is that if we use backup superblocks when the first
 superblock is not valid, we will be able to mount a none btrfs
 filesystem, which used to contains btrfs but other fs is made on it.

 The old problem can be solved related easily by checking the first
 superblock in a special way:
 1) If the magic number in the first superblock does not match:
This filesystem is not btrfs anymore, just exit.
If end-user consider it's really btrfs, then old 'btrfs rescue super'
method is still available.

 2) If the magic number in the first superblock matches but checksum does
not match:
This filesystem is btrfs but first superblock is corrupted, use
backup roots. Just continue searching remaining superblocks.
 
 I do agree that in these cases we can trust that the backup superblock
 comes from the same filesystem.
 
 But, for right now I'd prefer the admin get involved in using the backup
 supers.  I think silently using the backups is going to lead to surprises.
Maybe there could be a mount non-default mount-option to use backup
superblocks iff the first one is corrupted, and then log a warning
whenever this actually happens?  Not handling stuff like this
automatically really hurts HA use cases.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted.

2014-07-27 Thread Austin S Hemmelgarn

On 07/27/2014 08:29 PM, Qu Wenruo wrote:

  Original Message 
 Subject: Re: [PATCH RFC] btrfs: Use backup superblocks if and only if
 the first superblock is valid but corrupted.
 From: Austin S Hemmelgarn ahferro...@gmail.com
 To: Chris Mason c...@fb.com, Qu Wenruo quwen...@cn.fujitsu.com,
 linux-btrfs@vger.kernel.org
 Date: 2014年07月27日 10:57
 On 07/24/2014 05:28 PM, Chris Mason wrote:

 On 06/26/2014 11:53 PM, Qu Wenruo wrote:
 Current btrfs will only use the first superblock, making the backup
 superblocks only useful for 'btrfs rescue super' command.

 The old problem is that if we use backup superblocks when the first
 superblock is not valid, we will be able to mount a none btrfs
 filesystem, which used to contains btrfs but other fs is made on it.

 The old problem can be solved related easily by checking the first
 superblock in a special way:
 1) If the magic number in the first superblock does not match:
 This filesystem is not btrfs anymore, just exit.
 If end-user consider it's really btrfs, then old 'btrfs rescue
 super'
 method is still available.

 2) If the magic number in the first superblock matches but checksum
 does
 not match:
 This filesystem is btrfs but first superblock is corrupted, use
 backup roots. Just continue searching remaining superblocks.
 I do agree that in these cases we can trust that the backup superblock
 comes from the same filesystem.

 But, for right now I'd prefer the admin get involved in using the backup
 supers.  I think silently using the backups is going to lead to
 surprises.
 Maybe there could be a mount non-default mount-option to use backup
 superblocks iff the first one is corrupted, and then log a warning
 whenever this actually happens?  Not handling stuff like this
 automatically really hurts HA use cases.

 This seems better and comments also shows this idea.
 What about merging the behavior into 'recovery' mount option or adding a
 new mount option?
Personally, I'd add a new mount option, but make recovery imply that option.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Multi Core Support for compression in compression.c

2014-07-27 Thread Austin S Hemmelgarn

On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems to be only
 using one core to compress.
 Depending on the CPU used and the amount of cores in the CPU we can
 make this much faster
 with multiple cores. This seems bad by my reading at least I would
 recommend for writing compression
 we write a function to use a certain amount of cores based on the load
 of the system's CPU not using
 more then 75% of the system's CPU resources as my system when idle has
 never needed more
 then one core of my i5 2500k to run when with interrupts for opening
 eclipse are running. For reading
 compression on good core seems fine to me as testing other compression
 software for reads , it's
 way less CPU intensive.
 Cheers Nick
We would probably get a bigger benefit from taking an approach like
SquashFS has recently added, that is, allowing multi-threaded
decompression fro reads, and decompressing directly into the pagecache.
 Such an approach would likely make zlib compression much more scalable
on large systems.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Multi Core Support for compression in compression.c

2014-07-28 Thread Austin S Hemmelgarn

On 07/27/2014 11:21 PM, Nick Krause wrote:
 On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems to be only
 using one core to compress.
 Depending on the CPU used and the amount of cores in the CPU we can
 make this much faster
 with multiple cores. This seems bad by my reading at least I would
 recommend for writing compression
 we write a function to use a certain amount of cores based on the load
 of the system's CPU not using
 more then 75% of the system's CPU resources as my system when idle has
 never needed more
 then one core of my i5 2500k to run when with interrupts for opening
 eclipse are running. For reading
 compression on good core seems fine to me as testing other compression
 software for reads , it's
 way less CPU intensive.
 Cheers Nick
 We would probably get a bigger benefit from taking an approach like
 SquashFS has recently added, that is, allowing multi-threaded
 decompression fro reads, and decompressing directly into the pagecache.
  Such an approach would likely make zlib compression much more scalable
 on large systems.


 
 Austin,
 That seems better then my idea as you seem to be more up to date on
 brtfs devolopment.
 If you and the other developers of brtfs are interested in adding this
 as a feature please let
 me known as I would like to help improve brtfs as the file system as
 an idea is great just
 seems like it needs a lot of work :).
 Nick
I wouldn't say that I am a BTRFS developer (power user maybe?), but I
would definitely say that parallelizing compression on writes would be a
good idea too (especially for things like lz4, which IIRC is either in
3.16 or in the queue for 3.17).  Both options would be a lot of work,
but almost any performance optimization would.  I would almost say that
it would provide a bigger performance improvement to get BTRFS to
intelligently stripe reads and writes (at the moment, any given worker
thread only dispatches one write or read to a single device at a time,
and any given write() or read() syscall gets handled by only one worker).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Multi Core Support for compression in compression.c

2014-07-28 Thread Austin S Hemmelgarn

On 2014-07-28 11:57, Nick Krause wrote:
 On Mon, Jul 28, 2014 at 11:13 AM, Nick Krause xerofo...@gmail.com
 wrote:
 On Mon, Jul 28, 2014 at 6:10 AM, Austin S Hemmelgarn 
 ahferro...@gmail.com wrote:
 On 07/27/2014 11:21 PM, Nick Krause wrote:
 On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn 
 ahferro...@gmail.com wrote:
 On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems
 to be only using one core to compress. Depending on the
 CPU used and the amount of cores in the CPU we can make
 this much faster with multiple cores. This seems bad by
 my reading at least I would recommend for writing
 compression we write a function to use a certain amount
 of cores based on the load of the system's CPU not using 
 more then 75% of the system's CPU resources as my system
 when idle has never needed more then one core of my i5
 2500k to run when with interrupts for opening eclipse are
 running. For reading compression on good core seems fine
 to me as testing other compression software for reads ,
 it's way less CPU intensive. Cheers Nick
 We would probably get a bigger benefit from taking an
 approach like SquashFS has recently added, that is,
 allowing multi-threaded decompression fro reads, and
 decompressing directly into the pagecache. Such an approach
 would likely make zlib compression much more scalable on
 large systems.
 
 
 
 Austin, That seems better then my idea as you seem to be more
 up to date on brtfs devolopment. If you and the other
 developers of brtfs are interested in adding this as a
 feature please let me known as I would like to help improve
 brtfs as the file system as an idea is great just seems like
 it needs a lot of work :). Nick
 I wouldn't say that I am a BTRFS developer (power user maybe?),
 but I would definitely say that parallelizing compression on
 writes would be a good idea too (especially for things like
 lz4, which IIRC is either in 3.16 or in the queue for 3.17).
 Both options would be a lot of work, but almost any performance
 optimization would.  I would almost say that it would provide a
 bigger performance improvement to get BTRFS to intelligently
 stripe reads and writes (at the moment, any given worker thread
 only dispatches one write or read to a single device at a
 time, and any given write() or read() syscall gets handled by
 only one worker).
 
 
 I will look into this idea and see if I can do this for writes. 
 Regards Nick
 
 Austin, Seems since we don't want to release the cache for inodes
 in order to improve writes if are going to use the page cache. We
 seem to be doing this for writes in end_compressed_bio_write for
 standard pages and in end_compressed_bio_write. If we want to cache
 write pages why are we removing then ? Seems like this needs to be
 removed in order to start off. Regards Nick
 
I'm not entirely sure, it's been a while since I went exploring in the
page-cache code.  My guess is that there is some reason that you and I
aren't seeing that we are trying for write-around semantics, maybe one
of the people who originally wrote this code could weigh in?  Part of
this might be to do with the fact that normal page-cache semantics
don't always work as expected with COW filesystems (cause a write goes
to a different block on the device than a read before the write would
have gone to).  It might be easier to parallelize reads first, and
then work from that (and most workloads would probably benefit more
from the parallelized reads).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Multi Core Support for compression in compression.c

2014-07-29 Thread Austin S Hemmelgarn

On 2014-07-29 13:08, Nick Krause wrote:
 On Mon, Jul 28, 2014 at 2:36 PM, Nick Krause xerofo...@gmail.com wrote:
 On Mon, Jul 28, 2014 at 12:19 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 2014-07-28 11:57, Nick Krause wrote:
 On Mon, Jul 28, 2014 at 11:13 AM, Nick Krause xerofo...@gmail.com
 wrote:
 On Mon, Jul 28, 2014 at 6:10 AM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 07/27/2014 11:21 PM, Nick Krause wrote:
 On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems
 to be only using one core to compress. Depending on the
 CPU used and the amount of cores in the CPU we can make
 this much faster with multiple cores. This seems bad by
 my reading at least I would recommend for writing
 compression we write a function to use a certain amount
 of cores based on the load of the system's CPU not using
 more then 75% of the system's CPU resources as my system
 when idle has never needed more then one core of my i5
 2500k to run when with interrupts for opening eclipse are
 running. For reading compression on good core seems fine
 to me as testing other compression software for reads ,
 it's way less CPU intensive. Cheers Nick
 We would probably get a bigger benefit from taking an
 approach like SquashFS has recently added, that is,
 allowing multi-threaded decompression fro reads, and
 decompressing directly into the pagecache. Such an approach
 would likely make zlib compression much more scalable on
 large systems.



 Austin, That seems better then my idea as you seem to be more
 up to date on brtfs devolopment. If you and the other
 developers of brtfs are interested in adding this as a
 feature please let me known as I would like to help improve
 brtfs as the file system as an idea is great just seems like
 it needs a lot of work :). Nick
 I wouldn't say that I am a BTRFS developer (power user maybe?),
 but I would definitely say that parallelizing compression on
 writes would be a good idea too (especially for things like
 lz4, which IIRC is either in 3.16 or in the queue for 3.17).
 Both options would be a lot of work, but almost any performance
 optimization would.  I would almost say that it would provide a
 bigger performance improvement to get BTRFS to intelligently
 stripe reads and writes (at the moment, any given worker thread
 only dispatches one write or read to a single device at a
 time, and any given write() or read() syscall gets handled by
 only one worker).


 I will look into this idea and see if I can do this for writes.
 Regards Nick

 Austin, Seems since we don't want to release the cache for inodes
 in order to improve writes if are going to use the page cache. We
 seem to be doing this for writes in end_compressed_bio_write for
 standard pages and in end_compressed_bio_write. If we want to cache
 write pages why are we removing then ? Seems like this needs to be
 removed in order to start off. Regards Nick

 I'm not entirely sure, it's been a while since I went exploring in the
 page-cache code.  My guess is that there is some reason that you and I
 aren't seeing that we are trying for write-around semantics, maybe one
 of the people who originally wrote this code could weigh in?  Part of
 this might be to do with the fact that normal page-cache semantics
 don't always work as expected with COW filesystems (cause a write goes
 to a different block on the device than a read before the write would
 have gone to).  It might be easier to parallelize reads first, and
 then work from that (and most workloads would probably benefit more
 from the parallelized reads).

 I will look into this later today and work on it then.
 Regards Nick
 
 Seems the best way to do is to create a kernel thread per core like in NFS and
 depending on the load of the system use these threads.
 Regards Nick
 
It might be more work now, but it would probably be better in the long
run to do it using kernel workqueues, as they would provide better
support for suspend/hibernate/resume, and then you wouldn't need to
worry about scheduling or how many CPU cores are in the system.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Btrfs offline deduplication

2014-08-01 Thread Austin S Hemmelgarn

On 07/31/2014 07:54 PM, Timofey Titovets wrote:
 Good time of day.
 I have several questions about data deduplication on btrfs.
 Sorry if i ask stupid questions or waste you time %)
 
 What about implementation of offline data deduplication? I don't see
 any activity on this place, may be i need to ask a particular person?
 Where the problem? May be a can i try to help (testing as example)?
 
 I could be wrong, but as i understand btrfs store crc32 checksum one
 per file, if this is true, may be make a sense to create small worker
 for dedup files? Like worker for autodefrag?
 With simple logic like:
 if sum1 == sum2  file_size1 == file_size2; then
 if (bit_to_bit_identical(file1,2)); then merge(file1, file2);
 This can be first attempt to implement per file offline dedup
 What you think about it? could i be wrong? or this is a horrible crutch?
 (as i understand it not change format of fs)
 
 (bedup and other tools, its cool, but have several problem with these
 tools and i think, what kernel implementation can work better).
 
I think there may be some misunderstandings here about some of the
internals of BTRFS.  First of all, checksums are stored per block, not
per file, and secondly, deduplication can be done on a much finer scale
than individual files (you can deduplicate individual extents).

I do think however that having the option of a background thread doing
deduplication asynchronously is a good idea, but then you would have to
have some way to trigger it on individual files/trees, and triggering on
writes like the autodefrag thread does doesn't make much sense.  Having
some userspace program to tell it to run on a given set of files would
probably be the best approach for a trigger.  I don't remember if this
kind of thing was also included in the online deduplication patches that
got posted a while back or not.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Btrfs offline deduplication

2014-08-01 Thread Austin S Hemmelgarn

On 08/01/2014 02:55 PM, Mark Fasheh wrote:
 On Fri, Aug 01, 2014 at 10:16:08AM -0400, Austin S Hemmelgarn wrote:
 On 2014-08-01 09:23, David Sterba wrote:
 On Fri, Aug 01, 2014 at 06:17:44AM -0400, Austin S Hemmelgarn wrote:
 I do think however that having the option of a background thread doing
 deduplication asynchronously is a good idea, but then you would have to
 have some way to trigger it on individual files/trees, and triggering on
 writes like the autodefrag thread does doesn't make much sense.  Having
 some userspace program to tell it to run on a given set of files would
 probably be the best approach for a trigger.  I don't remember if this
 kind of thing was also included in the online deduplication patches that
 got posted a while back or not.

 IIRC the proposed implementation only merged new writes with existing
 data.

 For the out-of-band (off-line) dedup there's bedup
 (https://github.com/g2p/bedup) or Mark's duperemove tool
 (https://github.com/markfasheh/duperemove) that work on a set of files.

 Something kernel-side to do the work asynchronously would be nice,
 especially if it could leverage the check-sums that BTRFS already stores
 for the blocks.  Having a userspace interface for offline deduplication
 similar to that for scrub operations would even better.
 
 Why does this have to be kernel side? There's userspace software already to
 dedupe that can be run on a regular basis. Exporting checksums is a
 differnet story (you can do that via ioctl) but running the dedupe software
 itself inside the kernel is exactly what we want to avoid by having the
 dedupe ioctl in the first place.
   --Mark
 
 --
 Mark Fasheh
 
Based on the same logic however, we don't need scrub to be done kernel
side, as it wouldn't take but one more ioctl to be able to tell it which
block out of a set to treat as valid.  I'm not saying that things need
to be done in the kernel, but duperemove doesn't use the ioctl interface
even if it exists, and bedup is buggy as hell (unless it's improved
greatly in the last two weeks), and neither of them is at all efficient.
 I do understand that this isn't something that is computationally
simple (especially on x86 with it's defficiency of registers), but rsync
does almost the same thing for data transmission over the network, and
it does so seemingly much more efficiently than either option available
at the moment.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn

On 2014-08-04 09:17, Peter Waller wrote:
For anyone else having this problem, this article is fairly useful for
understanding disk full problems and rebalance:

http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

It actually covers the problem that I had, which is that a rebalance
can't take place because it is full.

I still am unsure what is really wrong with this whole situation. Is
it that I wasn't careful to do a rebalance when I should have been
doing? Is it that BTRFS doesn't do a rebalance automatically when it
could in principle?

It's pretty bad to end up in a situation (with spare space) where the
only way out is to add more storage, which may be impractical,
difficult or expensive.
I really disagree with the statement that adding more storage is
difficult or expensive, all you need to do is plug in a 2G USB flash
drive, or allocate a ramdisk, and add the device to the filesystem only
long enough to do a full balance.

The other thing that I still don't understand I've seen repeated in a
few places, from the above article:

because the filesystem is only 55% full, I can ask balance to rewrite
all chunks that are more than 55% full

Then he uses `btrfs balance start -dusage=55 /mnt/btrfs_pool1`. I
don't understand the relationship between the FS is 55% full and
chunks more than 55% full. What's going on here?
To understand this, you have to understand that BTRFS uses a two level
allocation scheme, at the top level, you have chunks, which are
contiguous regions of the disk that get used for storing a specific
block type. For data chunks, these default to 1G in size, for metadata,
they default to 256M in size. When a filesystem is created, you get the
minimum number of chunks of each type based on the replication profiles
chosen for each chunk type; with no extra options, this means 1 data
chunk and 2 metadata chunks for a single disk filesystem. Within each
chunk, BTRFS then allocates and frees individual blocks on demand, these
blocks are the analogue of blocks in most other filesystems. When there
are no free blocks in any chunks of a given type, BTRFS then allocates
new chunks of that type based on the replication profile. Unlike blocks
however, chunks aren't freed automatically (there are good reasons for
this behavior, but they are kind of long to explain here), this is where
balance comes in, it takes all of the blocks in the filesystem, and
sends them back through the block allocator. This usually causes all of
the free blocks to end up in a single chunk, and frees the unneeded chunks.

When someone talks about a chunk being x% full, they mean that x% of the
space in that chunk is used by allocated blocks. Talking about how full
the filesystem is can get tricky because of the replication profiles,
but the usual consensus is to treat that as the percentage of the
filesystem that contains blocks that are being used.

It should say LESS than 55% full in the various articles, as the
-dusage=x option tells balance to only consider chunks that are less
than 55% full for balancing. In general, if your filesystem is totally
full, you should use numbers starting with 0, and working your way up
from there. You may even get lucky, and using -dusage=0 -musage=0 may
free up enough chunks that you don't need to add more storage.

I conclude that now since I have added more storage, the rebalance
won't fail and if I keep rebalancing from a cron job I won't hit this
problem again (unless the filesystem fills up very fast! what then?).
I don't know however what value to assign to `-dusage` in general for
the cron rebalance. Any hints?
I've found that something between 25 and 50 tends to do well, much
outside of that range and you start to get diminishing returns. The
exact value tends to be more personal preference, I use 25 on most of my
systems, because I don't like saturating the disks with I/O for very
long. Do make sure however to add -musage=x as well, metadata also
should be balanced (especially if you have very large numbers of small
files).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

smime.p7s
Description: S/MIME Cryptographic Signature

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn

On 2014-08-04 10:11, Peter Waller wrote:
 On 4 August 2014 15:02, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 I really disagree with the statement that adding more storage is
 difficult or expensive, all you need to do is plug in a 2G USB flash
 drive, or allocate a ramdisk, and add the device to the filesystem only
 long enough to do a full balance.
 
 What if the machine is a server in a datacenter you don't have
 physical access to and the problem is an emergency preventing your
 users from being able to get work done?
 
 What happens if you use a RAM disk and there is a power failure?
 
I'm not saying that either option is a perfect solution.  In fact, the
only reason that I even mentioned the ramdisk is because I have had good
success with that on my laptop, but then laptops essentially have a
built-in UPS.  I personally wouldn't use a ramdisk except as a last
resort if you don't have some sort of UPS or redundancy in the PSU.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn

On 2014-08-04 06:31, Peter Waller wrote:
 Thanks Hugo, this is the most informative e-mail yet! (more inline)
 
 On 4 August 2014 11:22, Hugo Mills h...@carfax.org.uk wrote:

  * btrfs fi show
 - look at the total and used values. If used  total, you're OK.
   If used == total, then you could potentially hit ENOSPC.
 
 Another thing which is unclear and undocumented anywhere I can find is
 what the meaning of `btrfs fi show` is.
 
 I'm sure it is totally obvious if you are a developer or if you have
 used it for long enough. But it isn't covered in the manpage, nor in
 the oracle documentation, nor anywhere on the wiki that I could find.
 
You didn't look very hard then, because there is information in the
manpage (oh wait, you mentioned Oracle, your probably using RHEL or
CentOS, which are the last thing you should be using if you want to use
stuff like BTRFS that is under heavy development), and it is documented
on the wiki.
 When I looked at it in my problematic situation, it said 500 GiB /
 500 GiB. That sounded fine to me because I interpreted the output as
 what fraction of which RAID devices BTRFS was using. In other words, I
 thought Oh, BTRFS will just make use of the whole device that's
 available to it.. I thought that `btrfs fi df` was the source of
 information for how much space was free inside of that.
 
  * btrfs fi df
 - look at metadata used vs total. If these are close to zero (on
   3.15+) or close to 512 MiB (on 3.15), then you are in danger of
   ENOSPC.
 
 Hmm. It's unfortunate that this could indicate an amount of space
 which is free when it actually isn't.
That depends on what you mean by 'free'.
 
 - look at data used vs total. If the used is much smaller than
   total, you can reclaim some of the allocation with a filtered
   balance (btrfs balance start -dusage=5), which will then give
   you unallocated space again (see the btrfs fi show test).
 
 So the filtered balance didn't help in my situation. I understand it's
 something to do with the 5 parameter. But I do not understand what
 the impact of changing this parameter is. It is something to do with a
 fraction of something, but those things are still not present in my
 mental model despite a large amount of reading. Is there an
 illustration which could clear this up?
 
Think of each chunk like a box, and each block as a block, and that you
have two different types of block (data and metadata) and two different
types of box (also data and metadata). The data boxes are four times the
size of the metadata boxes, and they all have to fit in one really big
container (the device itself).  You can only put data blocks in the data
boxs, and you can only put metadata blocks in metadata boxes.  Say that
in total, you can fit 128 data boxes in the large container, or you can
replace one data box with up to four metadata boxes.  Even though you
may only have a few blocks in a given box, the box still takes up the
same amount of space in the larger container.  Thus, it's possible to
have only a few blocks stored, but not be able to add any more boxes to
the larger container.  A balance operation is essentially the equivalent
of taking all of the blocks of a given type, and fitting them into the
smallest number of boxes possible.
 Among other things I also got the kernel stack trace I pasted at the
 bottom of the first e-mail to this thread when I did the rebalance.
 
This FAQ entry is pretty horrible, I'm afraid. I actually started
 rewriting it here to try to make it clearer what's going on. I'll try
 to work on it a bit more this week and put out a better version for
 the wiki.
 
 This is great to hear! :)
 
 Thanks for your response Hugo, that really cleared up a lot of mental
 model problems. I hope the documentation can be improved so that
 others can learn from my mistakes.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 




smime.p7s
Description: S/MIME Cryptographic Signature

Re: ENOSPC with mkdir and rename

2014-08-05 Thread Austin S Hemmelgarn

On 2014-08-05 04:20, Duncan wrote:
 Austin S Hemmelgarn posted on Mon, 04 Aug 2014 13:09:23 -0400 as
 excerpted:
 
 Think of each chunk like a box, and each block as a block, and that you
 have two different types of block (data and metadata) and two different
 types of box (also data and metadata). The data boxes are four times the
 size of the metadata boxes, and they all have to fit in one really big
 container (the device itself).  You can only put data blocks in the data
 boxs, and you can only put metadata blocks in metadata boxes.  Say that
 in total, you can fit 128 data boxes in the large container, or you can
 replace one data box with up to four metadata boxes.  Even though you
 may only have a few blocks in a given box, the box still takes up the
 same amount of space in the larger container.  Thus, it's possible to
 have only a few blocks stored, but not be able to add any more boxes to
 the larger container.  A balance operation is essentially the equivalent
 of taking all of the blocks of a given type, and fitting them into the
 smallest number of boxes possible.
 
 FWIW, that's a great analogy to stick up on the wiki somewhere, probably 
 somewhere in the FAQ related to ENOSPC.  Please consider doing so.
 
 (Someone took one of my explanations from the list and stuck it in the 
 wiki, virtually word-for-word, with a link to the list post in the 
 archives for more.  I was glad, as for some reason I just seem to work 
 best on the lists, and seem to treat web pages as read-only, even if 
 they're on a wiki I in theory have or can get write-privs on.  I'm 
 suggesting someone, doesn't have to be you tho great if it is, do the 
 same with this.)
 
I would love to have it up on the wiki, but don't have an account or
write privileges.  FWIW, I consider anything I post on a mailing list
that isn't marked otherwise (except patches) to be public domain, so
everyone feel free to use it however you want.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Ideas for a feature implementation

2014-08-10 Thread Austin S Hemmelgarn

On 08/10/2014 03:21 PM, Vimal A R wrote:
 Hello,
 
 I came across the to-do list at 
 https://btrfs.wiki.kernel.org/index.php/Project_ideas and would like to know 
 if this list is updated and recent.
 
 I am looking for a project idea for my under graduate degree which can be 
 completed in around 3-4 months. Are there any suggestions and ideas to help 
 me further?
 
 Thank you,
 Vimal
It's not really listed there (though some of the projects there might be
considered subsets of it), but improved parallelization for the
multi-device setups is one thing that I know that a lot of people would
like to see.

Another thing that isn't listed there, that I would personally love to
see is support for secure file deletion.  To be truly secure though,
this would need to hook into the COW logic so that files marked for
secure deletion can't be reflinked (maybe make the automatically NOCOW
instead, and don't allow snapshots?), and when they get written to, the
blocks that get COW'ed have the old block overwritten.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas for a feature implementation

2014-08-11 Thread Austin S Hemmelgarn

On 08/11/2014 04:27 PM, Chris Murphy wrote:
 
 On Aug 10, 2014, at 8:53 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 
 
 Another thing that isn't listed there, that I would personally
 love to see is support for secure file deletion.  To be truly
 secure though, this would need to hook into the COW logic so that
 files marked for secure deletion can't be reflinked (maybe make
 the automatically NOCOW instead, and don't allow snapshots?), and
 when they get written to, the blocks that get COW'ed have the old
 block overwritten.
 
 If the file is reflinked or snapshot, then it can it be secure
 deleted? Because what does it mean to secure delete a file when
 there's a completely independent file pointing to the same physical
 blocks? What if someone else owns that independent file? Does the
 reflink copy get rm'd as well? Or does the file remain, but its
 blocks are zero'd/corrupted?
The semantics that I would expect would be that the extents can't be
reflinked, and when snapshotted the whole file gets COW'ed, and then
inherits the secure deletion flag, possibly with another flag saying
that the user can't disable the secure deletion flag.
 
 For SSDs, whether it's an overwrite or an FITRIM ioctl it's an open
 question when the data is actually irretrievable. It may be
 seconds, but could be much longer (hours?) so I'm not sure if it's
 useful. On HDD's using SMR it's not necessarily a given an
 overwrite will work there either.
By secure deletion, I don't mean make the data absolutely
unrecoverable by any means, I mean make it functionally impractical
for someone without low-level access to and/or extensive knowledge of
the hardware to recover the data; that is, more secure than simply
unlinking the file, but obviously less than (for example) the
application of thermite to the disk platters.  I'm talking the rough
equivalent of wiping the data from RAM.

Anyone who is truly security minded should be using whole disk
encryption anyway, but even then you have the data accessible from the
running OS.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas for a feature implementation

2014-08-12 Thread Austin S Hemmelgarn

On 2014-08-12 11:52, David Pottage wrote:
 
 On 11/08/14 03:53, Austin S Hemmelgarn wrote:
 
 Another thing that isn't listed there, that I would personally love to
 see is support for secure file deletion.  To be truly secure though,
 this would need to hook into the COW logic so that files marked for
 secure deletion can't be reflinked (maybe make the automatically NOCOW
 instead, and don't allow snapshots?), and when they get written to, the
 blocks that get COW'ed have the old block overwritten.
 How would secure deletion interact with file de-duplication?
 
 For example suppose you and I are both users on a multi user system. We
 both obtain copies of the same file independently, and save that file to
 our home directories.
 
 A background process notices that both files are the same and
 de-duplicates them. This means that both your file and mine point to the
 same blocks on disc. This is exactly the same as would happen if you
 made a COW copy of your file, transferred ownership to me, and I moved
 it into my home dir.
 
 You then decide to secure delete your copy of the file. What happens to
 mine? If it gets removed, then you have just deleted a file you don't
 own, if it does not then the file-system has broken the contract to
 secure delete a file when you asked it to.
 
 Also, what happens if the two files have similar portions, but they are
 not identical. For example, if you download and ISO image for ubuntu,
 and I download the ISO for kubuntu (at the same version). There will be
 a lot of sections that are the same, because they will contain a lot of
 packages in common, so there will be large gains in de-duplicating the
 similar parts, but most people would consider the files to be different.
 
 Could this mean that if you secure delete your ubuntu iso, then portions
 of my kubuntu iso might become corrupt?
 
You could work around this by marking the extent, instead of the file
(marking a file would mark all of it's extents), and then checking for
that marking when the extent is freed (ie, nobody refers to it anymore).
While this approach might not seem useful to most people, there are
practical use cases for it (even without whole disk encryption).
It would be pretty easy actually to integrate this globally for a
file-system as a mount option.
 Even if we limit secure delete to root, then we still leave the risk of
 unintentonaly breaking user files, because non-one realised that all or
 part of the file appears in other files via de-duplication. In any case
 if secure delete is limited to root, then most people would not find it
 useful. (or they would use sudo to do it, which brings us back to the
 same problems).
 
 Basically, I think that file secure deletion as a concept is not
 compatible with a 5th generation file system. If you relay want to
 securely remove a file, then copy the stuff you need elsewhere, and put
 the disc in the crusher. Alternatively put the filesystem in an encypted
 container, and then reformat the disc with a different encryption key.
 
While I agree that the traditional notion of secure deletion doesn't fit
in the current generation of file systems, there is still a need for COW
filesystems to be able to prevent sensitive data from being exposed
during run-time.  On any current BTRFS filesystem, it is still possible
to find blocks that have been COW'ed (assuming discard is turned off)
and have no referents, possibly long after the block itself is freed,
and especially if the volume is much larger than the stored data set
(like a large majority of desktop users these days) or the workload is
not write intensive.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Large files, nodatacow and fragmentation

2014-08-14 Thread Austin S Hemmelgarn

On 2014-08-14 10:30, G. Richard Bellamy wrote:
 On Wed, Aug 13, 2014 at 9:23 PM, Chris Murphy li...@colorremedies.com wrote:
 lsattr /var/lib/libvirt/images/atlas.qcow2

 Is the xattr actually in place on that file?
 
 2014-08-14 07:07:36
 $ filefrag /var/lib/libvirt/images/atlas.qcow2
 /var/lib/libvirt/images/atlas.qcow2: 46378 extents found
 2014-08-14 07:08:34
 $ lsattr /var/lib/libvirt/images/atlas.qcow2
 ---C /var/lib/libvirt/images/atlas.qcow2
 
 So, yeah, the attribute is set.
 

 It will fragment somewhat but I can't say that I've seen this much 
 fragmentation with xattr C applied to qcow2. What's the workload? How was 
 the qcow2 created? I recommend -o 
 preallocation=metadata,compat=1.1,lazy_refcounts=on when creating it. My 
 workloads were rather simplistic: OS installs and reinstalls. What's the 
 filesystem being used in the guest that's using the qcow2 as backing?
 
 When I created the file, I definitely preallocated the metadata, but
 did not set compat or lazy_refcounts. However, isn't that more a
 function of how qemu + KVM managed the image, rather than how btrfs?
 This is a p2v target, if that matters. Workload has been minimal since
 virtualizing because I have yet to get usable performance with this
 configuration. The filesystem in the guest is Win7 NTFS. I have seen
 massive thrashing of the underlying volume during VSS operations in
 the guest, if that signifies.
 

 It might be that your workload is best suited for a preallocated raw file 
 that inherits +C, or even possibly an LV.
 
 I'm close to that decision. As I mentioned, I much prefer the btrfs
 subvolume story over lvm, so moving to raw is probably more desirable
 than that... however, then I run into my lack of understanding of the
 difference between qcow2 and raw with respect to recoverability, e.g.
 does raw have the same ACID characteristics as a qcow2 image, or is
 atomicity a completely separate concern from the format? The ability
 for the owning process to recover from corruption or inconsistency is
 a key factor in deciding whether or not to turn COW off in btrfs - if
 your overlying system is capable of such recovery, like a database
 engine or (presumably) virtualization layer, then COW isn't a
 necessary function from the underlying system.
 
 So, just since I started this reply, you can see the difference in
 fragmentation:
 2014-08-14 07:25:04
 $ filefrag /var/lib/libvirt/images/atlas.qcow2
 /var/lib/libvirt/images/atlas.qcow2: 46461 extents found
 
 That's 17 minutes, an OS without interaction (I wasn't doing anything
 with it, but it may have been doing its own work like updates, etc.),
 and I see an fragmentation increase of 83 extents, and a raid10 volume
 that was beating itself up (I could hear the drives chattering away as
 they worked).
The fact that it is Windows using NTFS is probably part of the problem.
 Here's some things you can do to decrease it's background disk
utilization (these also improve performance on real hardware):
1. Disable system restore points.  These aren't really necessary if you
are running in a VM and can take snapshots from the host OS.
2. Disable the indexing service.  This does a lot of background disk IO,
and most people don't need the high speed search functionality.
3. Turn off Windows Features that you don't need.  This won't help disk
utilization much, but can greatly improve overall system performance.
4. Disable the paging file.  Windows does a lot of unnecessary
background paging, which can cause lots of unneeded disk IO.  Be careful
doing this however, as it may cause problems for memory hungry applications.
5. See if you can disable boot time services you don't need.  Bluetooth,
SmartCard, and Adaptive Screen Brightness are all things you probably
don't need in a VM environment.

Of these, 1, 2, and 4 will probably help the most.  The other thing is
that NTFS is a journaling file system, and putting a journaled file
system image on a COW backing store will always cause some degree of
thrashing, because the same few hundred MB of the disk get rewritten
over and over again, and the only way to work around that on BTRFS is to
make the file NOCOW, AND preallocate the entire file in one operation
(use the fallocate command from util-linux to do this).




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Questions on using BtrFS for fileserver

2014-08-19 Thread Austin S Hemmelgarn

On 2014-08-19 12:21, M G Berberich wrote:
 Hello,
 
 we are thinking about using BtrFS on standard hardware for a
 fileserver with about 50T (100T raw) of storage (25×4TByte).
 
 This is what I understood so far. Is this right?
 
 · incremental send/receive works.
 
 · There is no support for hotspares (spare disks that automatically
   replaces faulty disk).
 
 · BtrFS with RAID1 is fairly stable.
 
 · RAID 5/6 spreads all data over all devices, leading to performance
   problems on large diskarrays, and there is no option to limit the
   numbers of disk per stripe so far.
 
 Some questions:
 
 · There where reports, that bcache with btrfs leads to corruption. Is
   this still so?
Based on some testing I did last month, bcache with anything has the
potential to cause data corruption.
 
 · If a disk failes, does BtrFS rebalance automatically? (This would
   give a a kind o hotspare behavior)
No, but it wouldn't be hard to write a simple monitoring program to do
this from userspace.  IIRC, the big issue is that you need to add a
device in-place of the failed one for the re-balance to work.
 
 · Besides using bcache, are there any possibilities to boost
   performance by adding (dedicated) cache-SSDs to a BtrFS?
Like mentioned in one of the other responses, I would suggest looking
into dm-cache.  BTRFS itself does not have any functionality for this,
although there has been talk of implementing device priorities for
reads, which could provide a similar performance boost.
 
 · Are there any reports/papers/web-pages about BtrFS-systems this size
   in use? Praises, complains, performance-reviews, whatever…
While it doesn't quite fit the description, I have had very good success
with a very active 2TB BTRFS RAID10 filesystem consisting of BTRFS on
four unpartitioned 1TB SATA III hard drives.  The filesystem gets in
excess of 100GB of data written to it each day (almost all rewrites
however), and is what I use for /home, /var/log, and /var/lib, and I've
had no issues with it that were caused by BTRFS, and in-fact, the very
fact that it uses BTRFS helped me recover data when the storage
controller they are connected to went bad.  On average, I get about 125%
of raw disk performance on writes, and about 110% on reads.

If you are using a very large number of disks, then I would not suggest
that you use BTRFS RAID10, but instead BTRFS RAID1, as RAID10 will try
to stripe things across ALL of the devices in the filesystem, and unless
you have no more than about four times as many disks as storage
controllers (that is, each controller has no more than four disks
attached to it), the overhead outweighs the benefit of striping the data.

Also, just to make sure it's clear, in BTRFS RAID1, each block gets
written EXACTLY twice.  On the plus side though, this means that if you
do set-up a caching mechanism, you may be able to keep most of the array
spun down a majority of the time.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Questions on using BtrFS for fileserver

2014-08-20 Thread Austin S Hemmelgarn

On 08/19/2014 05:38 PM, Andrej Manduch wrote:
 Hi,
 
 On 08/19/2014 06:21 PM, M G Berberich wrote: · Are there any
 reports/papers/web-pages about BtrFS-systems this size
   in use? Praises, complains, performance-reviews, whatever…
 
 I don't know about papers or benchmarks but few weeks ago there was a
 guy who has problem with really long mounting with btrfs with similiar size.
 https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36226.html
 
 And I would not recommend 3TB disks. *I'm not btrfs dev* but as far as I
 know there is a quite different between rebuilding disk on real RAID and
 btrfs RAID. The problem is btrfs has RAID on filesystem level not on hw
 level so there is bigger mechanical overheat on drives and thus it take
 significantli longer than regular RAID.
It really suprises me that so many people come to this conclusion, but
maybe they don't provide as much slack space as I do on my systems.  In
general you will only have a longer rebuild on BTRFS than on hardware
RAID if the filesystem is more than about 50% full.  On my desktop array
(4x 1TB disks using BTRFS RAID10), I've replaced disks before and it
took less than an hour for the operation.  Of course that array is
usually not more than 10% full.  Interestingly, it took less time to
rebuild this array the last time I lost a disk than it did back when it
was 3x 1TB disks in a BTRFS RAID1, so things might improve overall with
a larger number of disks in the array.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Significance of high number of mails on this list?

2014-08-22 Thread Austin S Hemmelgarn

On 2014-08-20 23:22, Shriramana Sharma wrote:
 Hello. People on this list have been kind enough to reply to my
 technical questions. However, seeing the high number of mails on this
 list, esp with the title PATCH, I have a question about the
 development itself:
 
 Is this just an indication of a vibrant user/devel community [*] and
 healthy development of many new nice features to eventually come out
 in stable form later, or are we still at the fixing rough edges stage?
 IOW what is the proportion of commits adding new features to those
 stabilising/fixing features?
 
 [* Since there is no separate btrfs-users vs brtfs-dev I'm not able to
 gauge this difference either. i.e. if there were a dedicated -dev list
 I might not be alarmed by a high number of mails indicating fast
 development.]
 
 Mostly I have read like BTRFS is mostly stable but there might be a
 few corner cases as yet unknown since this is a totally new generation
 of FSs. But still given the volume of mails here I wanted to ask...
 I'm sorry I realize I'm being a bit vague but I'm not sure how to
 exactly express what I'm feeling about BTRFS right now...
 
Personally I'd say that BTRFS is 'stable' enough for light usage without
using stuff like quotas or RAID5/6.  So far, having used it since 3.10,
I've only once had a filesystem get corrupted when there wasn't some
serious underlying hardware issue (crashed disk, SATA controller
dropping random single sectors from writes, etc.), and it gives me much
better performance than what I previously used (ext4 on top of LVM).
As far as what to make of the volume of patches on the mailing list, I'd
say that that shouldn't be used as a measure of quality.  The ext4
mailing list is almost as busy on a regular basis, and people have been
using that in production for years, and the XFS mailing list gets much
higher volume of patches from time to time, and it's generally
considered the gold standard of a stable filesystem.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Distro vs latest kernel for BTRFS?

2014-08-22 Thread Austin S Hemmelgarn

On 2014-08-22 07:59, Shriramana Sharma wrote:
 Hello. I've seen repeated advices to use the latest kernel. While
 hearing of the recent compression bug affecting recent kernels does
 somewhat warn one off the previous advice, I would like to know what
 people who are running regular distros do to get the latest kernel.
 
 Personally I'm on Kubuntu, which provides mainline kernels till a
 particular point but not beyond that.
 
 Do people here always compile the latest kernel themselves just to get
 the latest BTRFS stability fixes (and  improvements, though as a
 second priority)?
 
I personally use Gentoo Unstable on all my systems, so I build all my
kernels locally anyway, and stay pretty much in-line with the current
stable Mainline kernel.
Interestingly, I haven't had any issues related to either of the
recently discovered bugs, despite meeting all of the criteria for being
affected by them.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Distro vs latest kernel for BTRFS?

2014-08-22 Thread Austin S Hemmelgarn

On 2014-08-22 14:22, Rich Freeman wrote:
 On Fri, Aug 22, 2014 at 8:04 AM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:

 I personally use Gentoo Unstable on all my systems, so I build all my
 kernels locally anyway, and stay pretty much in-line with the current
 stable Mainline kernel.
 
 Gentoo Unstable probably means gentoo-sources, testing version,
 which follows the stable kernel branch, but the most recent stable,
 and not the long-term stable.  gentoo-sources stable version generally
 follows the most recent longterm stable kernel (so 3.14 right now).
 I'm not sure what the exact policy is, but that is my sense of it.
 
 So, you're still running a stable kernel most likely.  If you really
 want mainline then you want git-sources.  That follows the most recent
 mainline I believe.  Of course, if you're following it that closely
 then you probably should think about just doing a git clone and
 managing it yourself, since then you can handle patches/etc more
 easily.
 
 I think the best option for somebody running btrfs is to stick with a
 stable kernel branch, either the current stable or a very recent
 longterm.  I wouldn't go back into 3.2 land or anything like that.
 
 But, yes, if you had stuck with 3.14 and not gone to the current
 stable then you would have missed the compress=lzo deadlock.  So, pick
 your poison.  :)
 
 Rich
 
By saying 'unstable' I'm referring to the stuff delimited in portage
with the ~ARCH keywords.  Personally, I wouldn't use that term myself
(all of my systems running on such packages have been rock-solid stable
from a software perspective), but that is how the official documentation
refers to things with the ~ARCH keywords.  There are a lot of Gentoo
users who don't know about the keyword thing other than as an occasional
inconvenience when emerging certain packages, so I just use the same
term as the documentation.

For the record, I am using the gentoo-sources package, but instead of
using what they mark as stable (which is 3.14), I'm using the most
recent version (which is 3.16.1).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: superblock checksum mismatch after crash, cannot mount

2014-08-25 Thread Austin S Hemmelgarn

On 2014-08-24 15:48, Chris Murphy wrote:
 
 On Aug 24, 2014, at 10:59 AM, Flash ROM flashromg...@yandex.com wrote:
 While it sounds dumb, this strange thing being done to put partition table 
 in separate erase block, so it never read-modify-written when FAT entries 
 are updated. Should something go wrong, FAR can recover from backup copy. 
 But erased partition table just suxx. Then, FAT tables are aligned in way to 
 fit well around erase block bounds.
 
 I think you seriously overestimate the knowledge of camera manufacturer's 
 about the details of flash storage; and any ability to discover it; and any 
 willingness on the part of the flash manufacturer to reveal such underlying 
 details. The whole point of these cards is to completely abstract the reality 
 of the underlying hardware from the application layer - in this case the 
 camera or mobile device using it.
 
If you really know what you are doing, it is possible to determine erase
block size by looking at device performance timings, with surprisingly
high accuracy (assuming you aren't trying to have software do it for
you).  I've actually done this before on several occasions, with nearly
100% success.
 Also, with SDXC exFAT is now specified. And it has only one FAT there isn't a 
 backup FAT. So they're even more difficult to recover data from should things 
 go awry filesystem wise.
 
It's too bad that TFAT didn't catch on, as it would have been great for
SD cards if it could be configured to put each FAT on a different erase
block.
 
 This said, you can *try* to reformat, BUT no standard OS of firmware 
 formatter will help you with default settings. They can't know geometry of 
 underlying NAND and controller properties. There is no standard, widely 
 accepted way to get such information from card. No matter if you use OS 
 formatter, camera formatter or whatever. YOU WILL RUIN factory format (which 
 is crafted in best possible way) and replace it with another, very likely 
 suboptimal one.
 
 It's recommended by the card manufacturers to reformat it in each camera its 
 inserted into. It's the only recommended way to erase the sd card for 
 re-use, they don't recommend selectively deleting images. And it's known that 
 one camera's partition table and formatting can irritate another camera 
 make/model if the card isn't reformatted by that camera.
 
It's not just cameras that have this issue, a lot of other hardware
makes stupid assumptions about the format of media.  The first firmware
release for the Nintendo Wii for example, chocked if you tried to use an
SD card with more than one partition on it, and old desktop versions of
Windows won't ever show you anything other than the first partition on
an SD card (or most USB storage devices for that matter).




smime.p7s
Description: S/MIME Cryptographic Signature

Re: ext4 vs btrfs performance on SSD array

2014-09-02 Thread Austin S Hemmelgarn

I wholeheartedly agree.  Of course, getting something other than CFQ as
the default I/O scheduler is going to be a difficult task.  Enough
people upstream are convinced that we all NEED I/O priorities, when most
of what I see people doing with them is bandwidth provisioning, which
can be done much more accurately (and flexibly) using cgroups.

Ironically, there have been a lot of in-kernel defaults that I have run
into issues with recently, most of which originated in the DOS era,
where a few MB of RAM was high-end.

On 2014-09-02 08:55, Zack Coffey wrote:
 While I'm sure some of those settings were selected with good reason,
 maybe there can be a few options (2 or 3) that have some basic
 intelligence at creation to pick a more sane option.
 
 Some checks to see if an option or two might be better suited for the
 fs. Like the RAID5 stripe size. Leave the default as is, but maybe a
 quick speed test to automatically choose from a handful of the most
 common values. If they fail or nothing better is found, then apply the
 default value just like it would now.
 
 
 On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig h...@infradead.org wrote:
 On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
 Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
 and XFS are doing is doing 128k IOs because that's the default block
 device readahead size.  'blockdev --setra 1024 /dev/sdd' before
 mounting the filesystem will probably fix it.

 Btw, it's really getting time to make Linux storage fs work out the
 box.  There's way to many things that are stupid by default and we
 require everyone to fix up manually:

  - the ridiculously low max_sectors default
  - the very small max readahead size
  - replacing cfq with deadline (or noop)
  - the too small RAID5 stripe cache size

 and probably a few I forgot about.  It's time to make things perform
 well out of the box..
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Large files, nodatacow and fragmentation

2014-09-02 Thread Austin S Hemmelgarn

On 2014-09-02 14:31, G. Richard Bellamy wrote:
 I thought I'd follow-up and give everyone an update, in case anyone
 had further interest.
 
 I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for
 bcache front device.
 
 It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache.
 
 2014-09-02 11:23:16
 root@eanna i /var/lib/libvirt/images # lsblk
 NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
 sda 8:00 558.9G  0 disk
 └─bcache3 254:30 558.9G  0 disk /var/lib/btrfs/data
 sdb 8:16   0 558.9G  0 disk
 └─bcache2 254:20 558.9G  0 disk
 sdc 8:32   0 558.9G  0 disk
 └─bcache1 254:10 558.9G  0 disk
 sdd 8:48   0 558.9G  0 disk
 └─bcache0 254:00 558.9G  0 disk
 sde 8:64   0 558.9G  0 disk
 └─bcache4 254:40 558.9G  0 disk
 sdf 8:80   0   1.8T  0 disk
 └─sdf1  8:81   0   1.8T  0 part
 sdg 8:96   0   477G  0 disk /var/lib/btrfs/system
 sdh 8:112  0   477G  0 disk
 sdi 8:128  0   477G  0 disk
 ├─bcache0 254:00 558.9G  0 disk
 ├─bcache1 254:10 558.9G  0 disk
 ├─bcache2 254:20 558.9G  0 disk
 ├─bcache3 254:30 558.9G  0 disk /var/lib/btrfs/data
 └─bcache4 254:40 558.9G  0 disk
 sr011:01  1024M  0 rom
 
 I further split the system and data drives of the VM Win7 guest. It's
 very interesting to see the huge level of fragmentation I'm seeing,
 even with the help of ordered writes offered by bcache - in other
 words while bcache seems to be offering me stability and better
 behavior to the guest, the underlying the filesystem is still seeing a
 level of fragmentation that has me scratching my head.
 
 That being said, I don't know what would be normal fragmentation of a
 VM Win7 guest system drive, so could be I'm just operating in my zone
 of ignorance again.
 
 2014-09-01 14:41:19
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 7 extents found
 atlas-system.qcow2: 154 extents found
 2014-09-01 18:12:27
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 28171 extents found
 2014-09-02 08:22:00
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 35281 extents found
 2014-09-02 08:44:43
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 37203 extents found
 2014-09-02 10:14:32
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 40903 extents found
 
This may sound odd, but are you exposing the disk to the Win7 guest as a
non-rotational device? Win7 and higher tend to have different write
behavior when they think they are on an SSD (or something else where
seek latency is effectively 0).  Most VMM's (at least, most that I've
seen) will use fallocate to punch holes for ranges that get TRIM'ed in
the guest, so if windows is sending TRIM commands, that may also be part
of the issue.  Also, you might try reducing the amount of logging in the
guest.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: No space on empty, degraded raid10

2014-09-08 Thread Austin S Hemmelgarn

On 2014-09-07 16:38, Or Tal wrote:
 Hi,
 
 I've created a new raid10 array from 4, 4TB drives in order to migrate
 old data to it.
 As I didn't have enough sata ports, I:
 - disconnected one of the raid10 disks to free a sata port,
 - connected an old disk I wanted to migrate,
 - mounted the array with -o degraded
 - copied the data it it.
 
 After about 2MB I got a no space left on device message.
 btrfs fi df showed strange things - much less space in every category
 (about 8GB?) and none of then was full.
 
 Ubuntu 14.10 beta - linux 3.16.0-14
Yeah, RAID10 doesn't really work in degraded mode (even if you have two
disks that have stripes from the same copy).  The approach that would be
needed for what you want to do is:
 1. Make a BTRFS RAID1 filesystem with _3_ new drives
 2. Connect one of the old disks
 3. Transfer data from old disk to new filesystem
 4. After repeating steps 2 and 3 for each old disk, connect the final
new disk, add it to the filesystem, and rebalance with '-dconvert=raid10
-mconvert=raid10'

Also, I've found out the hard way that system chunks really should be
RAID1, _NOT_ RAID10, otherwise it's very likely that the filesystem
won't mount at all if you lose 2 disks.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Austin S Hemmelgarn

On 2014-09-10 08:27, Bob Williams wrote:
 I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
 data and metadata. Last night I started
 
 # btrfs filesystem balance path
 
In general, unless things are really bad, you don't ever want to use
balance on such a big filesystem without some filters to control what
gets balanced (especially if the filesystem is more than about 50% full
most of the time).

My suggestion in this case would be to use:
# btrfs balance start -dusage=25 -musage=25 path
on a roughly weekly basis.  This will only balance chunks that are less
than 25% full, and therefore run much faster.  If you are particular
about high storage efficiency, then try 50 instead of 25.
 and it is still running 18 hours later. This suggests that most stuff
 only gets written to one physical device, which in turn suggests that
 there is a risk of lost data if one physical device fails. Or is there
 something clever about btrfs raid that I've missed? I've used linux
 software raid (mdraid) before, and it appeared to write to both
 devices simultaneously.
The reason that a full balance takes so long on a big (and I'm assuming
based on the 18 hours it's taken, very full) filesystem is that it reads
all of the data, and writes it out to both disks, but it doesn't do very
good load-balancing like mdraid or LVM do.  I've got a 4x 500Gib BTRFS
RAID10 filesystem that I use for my home directory on my desktop system,
and a full balance on that takes about 6 hours.
 
 Is it safe to interrupt [^Z] the btrfs balancing process?
^Z sends a SIGSTOP, which is a really bad idea with something that is
doing low-level stuff to a filesystem.  If you need to stop the balance
process (and are using a recent enough kernel and btrfs-progs), the
preferred way to do so is to use the following from another terminal:
# btrfs balance stop path
Depending on what the balance operation is working when you do this, it
may take a few minutes before it actually stops (the longest that I've
seen it take is ~200 seconds).
 
 As a rough guide, how often should one perform
 
 a) balance
 b) defragment
 c) scrub
 
 on a btrfs raid setup?
In general, you should be running scrub regularly, and balance and
defragment as needed.  On the BTRFS RAID filesystems that I have, I use
the following policy:
1) Run a 25% balance (the command I mentioned above) on a weekly basis.
2) If the filesystem has less than 50% of either the data or metadata
chunks full at the end of the month, run a full balance on it.
3) Run a scrub on a daily basis.
4) Defragment files only as needed (which isn't often for me because I
use the autodefrag mount option).
5) Make sure than only one of balance, scrub or defrag is running at a
given time.
Normally, you shouldn't need to run balance at all on most BTRFS
filesystems, unless your usage patterns vary widely over time (I'm
actually a good example of this, most of the files in my home directory
are relatively small, except for when I am building a system with
buildroot or compiling a kernel, and on occasion I have VM images that
I'm working with).



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Austin S Hemmelgarn

On 2014-09-10 09:48, Rich Freeman wrote:
 On Wed, Sep 10, 2014 at 9:06 AM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 Normally, you shouldn't need to run balance at all on most BTRFS
 filesystems, unless your usage patterns vary widely over time (I'm
 actually a good example of this, most of the files in my home directory
 are relatively small, except for when I am building a system with
 buildroot or compiling a kernel, and on occasion I have VM images that
 I'm working with).
 
 Tend to agree, but I do keep a close eye on free space.  If I get to
 the point where I'm over 90% allocated to chunks with lots of unused
 space otherwise I run a balance.  I tend to have the most problems
 with my root/OS filesystem running on a 64GB SSD, likely because it is
 so small.
 
 Is there a big performance penalty running mixed chunks on an SSD?  I
 believe this would get rid of the risk of ENOSPC issues if everything
 gets allocated to chunks.  There are obviously no issues with random
 access on an SSD, but there could be other problems (cache
 utilization, etc).
There shouldn't be any more performance penalty than for normally
running mixed chunks.  Also, a 64GB SSD is not small, I use a pair of
64GB SSD's in a BTRFS RAID1 configuration for root on my desktop, and
consistently use less than a quarter (12G on average) of the available
space, and that's with stuff like LibreOffice and the entire OpenClipart
distribution (although I'm not running an 'enterprise' distribution, and
keep /tmp and /var/tmp on tmpfs).
 
 I tend to watch btrfs fi sho and if the total space used starts
 getting high then I run a balance.  Usually I run with -dusage=30 or
 -dusage=50, but sometimes I get to the point where I just need to do a
 full balance.  Often it is helpful to run a series of balance commands
 starting at -dusage=10 and moving up in increments.  This at least
 prevents killing IO continuously for hours.  If we can get to a point
 where balancing can operate at low IO priority that would be helpful.
 
 IO priority is a problem in btrfs in general.  Even tasks run at idle
 scheduling priority can really block up a disk.  I've seen a lot of
 hurry-and-wait behavior in btrfs.  It seems like the initial commit to
 the log/etc is willing to accept a very large volume of data, and then
 when all the trees get updated the system grinds to a crawl trying to
 deal with all the data that was committed.  The problem is that you
 have two queues, with the second queue being rate-limiting but the
 first queue being the one that applies priority control.  What we
 really need is for the log to have controls on how much it accepts so
 that the updating of the trees/etc never is rate-limiting.   That will
 limit the ability to have short IO write bursts, but it would prevent
 low-priority writes from blocking high-priority read/writes.

You know, you can pretty easily control bandwidth utilization just using
cgroups.  This is what I do, and I get much better results with cgroups
and the deadline IO scheduler than I ever did with CFQ. Abstract
priorities are a not bad for controlling relative CPU utilization, but
they really suck for IO scheduling.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: No space on empty, degraded raid10

2014-09-11 Thread Austin S Hemmelgarn

On 2014-09-11 02:40, Russell Coker wrote:
 On Mon, 8 Sep 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 Also, I've found out the hard way that system chunks really should be
 RAID1, NOT RAID10, otherwise it's very likely that the filesystem
 won't mount at all if you lose 2 disks.
 
 Why would that be different?
 
 In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be 
 different?
That's still the case, but in a RAID1 with four disks, of the six
different pairs of two disks you could lose, only one will make the
filesystem un-mountable, whereas for a four disk RAID10, there are two
different pairs of two disks you could lose to make the filesystem
un-mountable.  In haven't run the numbers for higher numbers of disks,
but things are likely not better, because if you lose both copies of the
same stripe, things will fail.
 
 Also it would be nice if there was a N-way mirror option for system data.  As 
 such data is tiny (32MB on the 120G filesystem in my workstation) the space 
 used by having a copy on every disk in the array shouldn't matter.
 
N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
is ready, mkfs should default to one copy per disk in the filesystem.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: No space on empty, degraded raid10

2014-09-11 Thread Austin S Hemmelgarn

On 2014-09-11 07:38, Hugo Mills wrote:
 On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote:
 On 2014-09-11 02:40, Russell Coker wrote:
 Also it would be nice if there was a N-way mirror option for system data.  
 As 
 such data is tiny (32MB on the 120G filesystem in my workstation) the space 
 used by having a copy on every disk in the array shouldn't matter.

 N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
 is ready, mkfs should default to one copy per disk in the filesystem.
 
Why change the default from 2-copies, which it's been for years?

Sorry about the ambiguity in my statement, I meant that the default for
system chunks should be one copy per disk in the filesystem.  If you
don't have a copy of the system chunks, then you essentially don't have
a filesystem, and that means that BTRFS RAID6 can't provide true
resilience against 2 disks failing catastrophically unless there are at
least 3 copies of the system chunks.



smime.p7s
Description: S/MIME Cryptographic Signature

Problem with unmountable filesystem.

2014-09-16 Thread Austin S Hemmelgarn

So, I just recently had to hard reset a system running  root on BTRFS,
and when it tried to come back up, it chocked on the root filesystem.
Based on the kernel messages, the primary issue is log corruption, and
in theory btrfs-zero-log should fix it.  The actual issue however, is
that the primary superblock appears to be pointing at a corrupted root
tree, which causes pretty much everything that does anything other than
just read the sb to fail.  The first backup sb does point to a good
tree, but only btrfs check and btrfs restore have any option to ignore
the first sb and use one of the backups instead.  To make matters more
complicated, the first sb still has a valid checksum and passes the
tests done by btrfs rescue super-recover, and therefore that can't be
used to recover either.  I was wondering if anyone here might have any
advice.  I'm fine using dd to replace the primary sb with one of the
backups, but don't know the exact parameters that would be needed.
Also, we should consider adding a mount option to select a specific sb
mirror to use; I know that ext* have such an option, and that has
actually saved me a couple of times.  I'm using btrfs-progs 3.16 and
kernel 3.16.1.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Problem with unmountable filesystem.

2014-09-17 Thread Austin S Hemmelgarn

On 2014-09-16 16:57, Chris Murphy wrote:
 
 On Sep 16, 2014, at 8:40 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 
 Based on the kernel messages, the primary issue is log corruption, and
 in theory btrfs-zero-log should fix it.
 
 Can you provide a complete dmesg somewhere for this initial failure, just for 
 reference? I'm curious what this indication looks like compared to other 
 problems.
 
Okay, I can't really get a 'complete' dmesg, because the system panics 
on the mount failure (the filesystem in question is the system's root 
filesystem), the system has no serial ports, and I didn't think to 
build in support for console on ttyUSB0.  I can however get what the 
recovery environment (locally compiled based on buildroot) shows when I 
try to mount the filesystem:
[   30.871036] BTRFS: device label gentoo devid 1 transid 160615 /dev/sda3
[   30.875225] BTRFS info (device sda3): disk space caching is enabled
[   30.917091] BTRFS: detected SSD devices, enabling SSD mode
[   30.920536] BTRFS: bad tree block start 0 130402254848
[   30.924018] BTRFS: bad tree block start 0 130402254848
[   30.926234] BTRFS: failed to read log tree
[   30.953055] BTRFS: open_ctree failed
  The actual issue however, is
 that the primary superblock appears to be pointing at a corrupted root
 tree, which causes pretty much everything that does anything other than
 just read the sb to fail.  The first backup sb does point to a good
 tree, but only btrfs check and btrfs restore have any option to ignore
 the first sb and use one of the backups instead.
 
 Maybe use wipefs -a on this volume, which removes the magic from only the 
 first superblock by default (you can specify another location). And then try 
 btrfs-show-super -F which dumps supers with bad magic.
 
Thanks for the suggestion, I hadn't thought of that...
 I just tried this:
 # wipefs -a /dev/sdb
 /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 
 5f 4d
 # btrfs-show-super -F /dev/sdb
 superblock: bytenr=65536, device=/dev/sdb
 -
 csum  0x5c1196d7 [DON'T MATCH]
 bytenr65536
 flags 0x1
 magic  [DON'T MATCH]
 […]
 # btrfs-show-super -i1 /dev/sdb
 superblock: bytenr=67108864, device=/dev/sdb
 -
 csum  0xfc70be19 [match]
 bytenr67108864
 flags 0x1
 magic _BHRfS_M [match]
 
 So the mirror is definitely there and valid.
 # btrfs rescue super-recover -yv /dev/sdb
 No valid Btrfs found on /dev/sdb
 Usage or syntax errors
 
 Not expected at all, man page says Recover bad superblocks from good 
 copies. There's a good copy, it's not being found by btrfs rescue 
 super-recover. Seems like a bug.
 
 
 # btrfs check /dev/sdb
 No valid Btrfs found on /dev/sdb
 Couldn't open file system
 
 # btrfs check -s1 /dev/sdb
 using SB copy 1, bytenr 67108864
 Checking filesystem on /dev/sdb
 UUID: 9acf13de-5b98-4f28-9992-533e4a99d348
 [snip]
 OK it finds it, maybe a --repair will fix the bad first one?
 # btrfs check -s1 /dev/sdb
 using SB copy 1, bytenr 67108864
 enabling repair mode
 Checking filesystem on /dev/sdb
 UUID: 9acf13de-5b98-4f28-9992-533e4a99d348
 [snip]
 No indication of repair
 # btrfs check /dev/sdb
 No valid Btrfs found on /dev/sdb
 Couldn't open file system
 # btrfs check /dev/sdb
 No valid Btrfs found on /dev/sdb
 Couldn't open file system
 [root@f21v ~]# btrfs-show-super -F /dev/sdb
 superblock: bytenr=65536, device=/dev/sdb
 -
 csum  0x5c1196d7 [DON'T MATCH]
 bytenr65536
 flags 0x1
 magic  [DON'T MATCH]
 
 
 Still not fixed. Maybe I needed to corrupt something else in the superblock 
 other than the magic and this behavior is intentional, otherwise wipefs -a, 
 followed by btrfsck would resurrect an intentionally wiped btrfs fs, 
 potentially wiping out some newer file system in the process.
 
...though maybe it's a good thing I didn't.
 
 
 I'm fine using dd to replace the primary sb with one of the
 backups, but don't know the exact parameters that would be needed.
 
 Here's an idea:
 
 # btrfs-show-super /dev/sdb
 superblock: bytenr=65536, device=/dev/sdb
 -
 csum  0x92aa51ab [match]
 [snip]
 So I know what I'm looking for starts at LBA 65536/512
 
 # dd if=/dev/sdb skip=128 count=4 2/dev/null | hexdump -C
   92 aa 51 ab 00 00 00 00  00 00 00 00 00 00 00 00  |..Q…..|
 [snip]
 
 And as it turns out the csum is right at the beginning, 4 bytes. So use bs of 
 4 bytes, seek 65536/4, count of 1. This should zero just 4 bytes starting at 
 65536 bytes in.
 
 # dd if=/dev/zero of=/dev/sdb bs=4 seek=16384 count=1
 
 Checked it with the earlier skip=128

Re: Problem with unmountable filesystem.

2014-09-18 Thread Austin S Hemmelgarn

On 09/17/2014 02:57 PM, Chris Murphy wrote:
 
 On Sep 17, 2014, at 5:23 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote:

 Thanks for all the help.
 
 Well, it's not much help. It seems possible to corrupt a primary superblock 
 that points to a corrupt tree root, and use btrfs rescure super-recover to 
 replace it, and then mount should work. One thing I didn't try was corrupting 
 the primary superblock and just mounting normally or with recovery, to see if 
 it'll automatically ignore the primary superblock and use the backup.
 
 But I think you're onto something, that a good superblock can point to a 
 corrupt tree root, and then not have a straight forward way to mount the good 
 tree root. If I understand this correctly.
 
Corrupting the primary superblock did in fact work, and I decided to try
mounting immediately, which failed.  I didn't try with -o recovery, but
I think that would probably fail as well.  Things worked perfectly
however after using btrfs rescue super-recover.  As far as avoiding
future problems, I think the best solution would be to have the mount
operation try the tree root pointed to by the backup superblock if the
one pointed to by the primary seems corrupted.

Secondarily, this almost makes me want to set the ssd option on all
BTRFS filesystems, just to get the rotating superblock updates, because
if it weren't for that behavior, I probably wouldn't have been able to
recovery anything in this particular case.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with unmountable filesystem.

2014-09-18 Thread Austin S Hemmelgarn

On 09/17/2014 04:22 PM, Duncan wrote:
 Austin S Hemmelgarn posted on Wed, 17 Sep 2014 07:23:46 -0400 as
 excerpted:
 
 I've also discovered, when trying to use btrfs restore to copy out the
 data to a different system, that 3.14.1 restore apparently chokes on
 filesystem that have lzo compression turned on.  It's reporting errors
 trying to inflate compressed files, and I know for a fact that none of
 those files were even open, let alone being written to, when the system
 crashed.  I don't know if this is a known bug or even if it is still the
 case with btrfs-progs 3.16, but I figured I'd comment about it because I
 haven't seen anything about it anywhere.
 
 FWIW that's a known and recently patched issue.  If you're still seeing 
 issues with it with btrfs-progs 3.16, report it, but 3.14.1 almost 
 certainly wouldn't have had the fix.  (This is one related patch turned 
 up by a quick search; there may be others.)
 
 * commit 93ebec96f2ae1d3276ebe89e2d6188f9b46692fb
 | Author: Vincent Stehlé vincent.ste...@laposte.net
 | Date:   Wed Jun 18 18:51:19 2014 +0200
 |
 | btrfs-progs: restore: check lzo compress length
 |
 | When things go wrong for lzo-compressed btrfs, feeding
 | lzo1x_decompress_safe() with corrupt data during restore
 | can lead to crashes. Reduce the risk by adding
 | a check on the input length.
 |
 | Signed-off-by: Vincent Stehlé vincent.ste...@laposte.net
 | Signed-off-by: David Sterba dste...@suse.cz
 |
 |  cmds-restore.c | 6 ++
 |  1 file changed, 6 insertions(+)
 
Yeah, 3.16 seems fine, I just hadn't updated my recovery environment
yet.  Ironically, I did some performance testing afterwards, and
realized that using any compression was actually slowing down my system
(my disk appears to be faster than my RAM, which is really sad, even for
a laptop).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance Issues


On 2014-09-19 08:18, Rob Spanton wrote:

Hi,

I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.  A `git
status` in the linux source tree takes about 46 seconds after dropping
caches, whereas on other machines using ext4 this takes about 13s.  My
mail client (evolution) also seems to perform particularly poorly on
this setup, and my hunch is that it's spending a lot of time waiting on
the filesystem.

I've tried mounting with noatime, and this has had no effect.  Anyone
got any ideas?

Here are the things that the wiki page asked for [1]:

uname -a:

 Linux zarniwoop.blob 3.16.2-200.fc20.x86_64 #1 SMP Mon Sep 8
 11:54:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version:

 Btrfs v3.16

btrfs fi show:

 Label: 'fedora'  uuid: 717c0a1b-815c-4e6a-86c0-60b921e84d75
Total devices 1 FS bytes used 1.49TiB
devid1 size 2.72TiB used 1.50TiB path /dev/sda4

 Btrfs v3.16

btrfs fi df /:

 Data, single: total=1.48TiB, used=1.48TiB
 System, DUP: total=32.00MiB, used=208.00KiB
 Metadata, DUP: total=11.50GiB, used=10.43GiB
 unknown, single: total=512.00MiB, used=0.00

dmesg dump is attached.

Please CC any responses to me, as I'm not subscribed to the list.

Cheers,

Rob

[1] https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list


WRT the performance of Evolution, the issue is probably fragmentation of 
the data files.  If you run the command:

# btrfs fi defrag -rv /home
you should see some improvement in evolution performance (until you get 
any new mail that is).  Evolution (like most graphical e-mail clients 
these days) uses sqlite for data storage, and sqlite database files are 
one of the known pathological cases for COW filesystems in general; the 
solution is to mark the files as NOCOW (see the info about VM images in 
[1] and [2], the same suggestions apply to database files).


As for git, I haven't seen any performance issues specific to BTRFS; are 
you using any compress= mount option? zlib based compression is known to 
cause serious slowdowns.  I don't think that git uses any kind of 
database for data storage.  Also, if the performance comparison is from 
other systems, unless those systems have the EXACT same hardware 
configuration, they aren't really a good comparison.  Unless the pc this 
is on is a relatively recent system (less than a year or two old), it 
may just be hardware that is the performance bottleneck.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Performance Issues


On 2014-09-19 08:25, Swâmi Petaramesh wrote:

Le vendredi 19 septembre 2014, 13:18:34 Rob Spanton a écrit :

I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.


Weeelll I have the same over-complicated kind of setup, and an Arch Linux
BTRFS system which used to boot in some decent amout of time in the past now
takes about 5 full minutes to just make it to the KDM login prompt, and
another 5 minutes before KDE is fully started. Makes me think of the good ole'
times of Windows 95 OSR2 on a 486SX with a dying 1 GB Hard disk...
Well, part of your problem might be KDE itself, it's extremely CPU 
intensive these days.  I'd suggest disabling the 'semantic desktop' 
stuff, because that tends to be the worst offender as far as soaking up 
system resources.  Also, if you recently switched to systemd, that may 
be causing some slowdown as well (journald's default settings are 
terrible for performance)


Now, let me add that I had removed all snaphots, ran a full defrag, and even
rebalanced the damn thing without any positive effect...

(And yes, my HD is physically in good shape, SMART feels fully happy, and it's
less than 75% full...)

I've been using BTRFS for 2-3 years on a dozen of different systems, and if
something doesn't surprise me at all, it's « slow performance », indeed,
although I'm myself more accustomed to « incredibly fscking damn slow
performance »...
It's kind of funny, but I haven't had any performance issues with BTRFS 
since about 3.10, even on the systems my employer is using Fedora 20 on, 
and those use only a Core 2 Duo Processor, DDR2-800 RAM, and SATA2 hard 
drives.

HTH






smime.p7s
Description: S/MIME Cryptographic Signature

Re: Performance Issues


On 2014-09-19 08:49, Austin S Hemmelgarn wrote:

On 2014-09-19 08:18, Rob Spanton wrote:

Hi,

I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.  A `git
status` in the linux source tree takes about 46 seconds after dropping
caches, whereas on other machines using ext4 this takes about 13s.  My
mail client (evolution) also seems to perform particularly poorly on
this setup, and my hunch is that it's spending a lot of time waiting on
the filesystem.

I've tried mounting with noatime, and this has had no effect.  Anyone
got any ideas?

Here are the things that the wiki page asked for [1]:

uname -a:

 Linux zarniwoop.blob 3.16.2-200.fc20.x86_64 #1 SMP Mon Sep 8
 11:54:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version:

 Btrfs v3.16

btrfs fi show:

 Label: 'fedora'  uuid: 717c0a1b-815c-4e6a-86c0-60b921e84d75
 Total devices 1 FS bytes used 1.49TiB
 devid1 size 2.72TiB used 1.50TiB path /dev/sda4

 Btrfs v3.16

btrfs fi df /:

 Data, single: total=1.48TiB, used=1.48TiB
 System, DUP: total=32.00MiB, used=208.00KiB
 Metadata, DUP: total=11.50GiB, used=10.43GiB
 unknown, single: total=512.00MiB, used=0.00

dmesg dump is attached.

Please CC any responses to me, as I'm not subscribed to the list.

Cheers,

Rob

[1] https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list



WRT the performance of Evolution, the issue is probably fragmentation of
the data files.  If you run the command:
# btrfs fi defrag -rv /home
you should see some improvement in evolution performance (until you get
any new mail that is).  Evolution (like most graphical e-mail clients
these days) uses sqlite for data storage, and sqlite database files are
one of the known pathological cases for COW filesystems in general; the
solution is to mark the files as NOCOW (see the info about VM images in
[1] and [2], the same suggestions apply to database files).

As for git, I haven't seen any performance issues specific to BTRFS; are
you using any compress= mount option? zlib based compression is known to
cause serious slowdowns.  I don't think that git uses any kind of
database for data storage.  Also, if the performance comparison is from
other systems, unless those systems have the EXACT same hardware
configuration, they aren't really a good comparison.  Unless the pc this
is on is a relatively recent system (less than a year or two old), it
may just be hardware that is the performance bottleneck.


Realized after I sent this that I forgot the links for [1] and [2]

[1] https://btrfs.wiki.kernel.org/index.php/UseCases
[2] https://btrfs.wiki.kernel.org/index.php/FAQ



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Performance Issues


On 2014-09-19 09:51, Holger Hoffstätte wrote:


On Fri, 19 Sep 2014 13:18:34 +0100, Rob Spanton wrote:


I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.  A `git
status` in the linux source tree takes about 46 seconds after dropping
caches, whereas on other machines using ext4 this takes about 13s.  My
mail client (evolution) also seems to perform particularly poorly on
this setup, and my hunch is that it's spending a lot of time waiting on
the filesystem.


This is - unfortunately - a particular btrfs oddity/characteristic/flaw,
whatever you want to call it. git relies a lot on fast stat() calls,
and those seem to be particularly slow with btrfs esp. on rotational
media. I have the same problem with rsync on a freshly mounted volume;
it gets fast (quite so!) after the first run.
I find that kind of funny, because regardless of filesystem, stat() is 
one of the *slowest* syscalls on almost every *nix system in existence.


The simplest thing to fix this is a du -s /dev/null to pre-cache all
file inodes.

I'd also love a technical explanation why this happens and how it could
be fixed. Maybe it's just a consequence of how the metadata tree(s)
are laid out on disk.
While I don't know for certain, I think it's largely just a side effect 
of the lack of performance tuning in the BTRFS code.



I've tried mounting with noatime, and this has had no effect.  Anyone
got any ideas?


Don't drop the caches :-)

-h






smime.p7s
Description: S/MIME Cryptographic Signature

Re: Problem with unmountable filesystem.


On 2014-09-19 13:07, Chris Murphy wrote:

Possibly btrfs-select-super can do some of the things I was doing the hard way. It's 
possible to select a super to overwrite other supers, even if they're good 
ones. Whereas btrfs rescue super-recover won't do that, and neither will btrfsck, hence 
why I corrupted the one I didn't want first. This command isn't built by default (at 
least not on Fedora).
I don't think it's built by default on any of the major distributions. 
On Gentoo you need to set package specific configure options.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: Problem with unmountable filesystem.


On 2014-09-19 13:54, Chris Murphy wrote:


On Sep 17, 2014, at 5:23 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote:

[   30.920536] BTRFS: bad tree block start 0 130402254848
[   30.924018] BTRFS: bad tree block start 0 130402254848
[   30.926234] BTRFS: failed to read log tree
[   30.953055] BTRFS: open_ctree failed

I'm still confused. Btrfs knows this tree root is bad, but it has backup roots. 
So why wasn't one of those used by -o recovery? I thought that's the whole 
point of that mount option. Backup tree roots are per superblock, so 
conceivably you'd have up to 8 of these with two superblocks, they're shown with
btrfs-show-super -af  ## and -F even if a super is bad

But skipping that, to fix this you need to know which super is pointing to the 
wrong tree root, since you're using ssd mount option with rotating supers. I 
assume mount uses the super with the highest generation number. So you'd need 
to:
btrfs-show-super -a
to find out the super with the most recent generation. You'd assume that one 
was wrong. And then use btrfs-select-super to pick the right one, and replace 
the wrong one. Then you could mount.

I also wonder if btrfs check -sX would show different results in your case. I'd 
think it would because it ought to know one of those tree roots is bad, seeing 
as mount knows it. And then it seems (I'm speculating a ton) that --repair 
might try to fix the bad tree root, and then if it fails I'd like to think it 
can just find the most recent good tree root, ideally one listed as a 
backup_tree_root by any good superblock, and then have the next mount use that.

I'm not sure why this persistently fails, and I wonder if there are cases of 
users giving up and blowing away file systems that could actually be mountable. 
But it's just really a manual process figuring out what things to do in what 
order to get them to mount.

From what I can tell, btrfs check doesn't do anything about backup 
superblocks unless you specifically tell it to.  In this case, running 
btrfs check without specifying a superblock mirror, and with explicitly 
specifying the primary superblock produced identical results (namely it 
choked, hard, with an error message similar to that from the kernel. 
However, running it with -s1 to select the first backup superblock 
returned no errors at all other than the space_cache being invalid and 
the count of used blocks being wrong.


Based on my (limited) understanding of the mount code, it does try to 
use the superblock with the highest generation (regardless of whether we 
are on an ssd or not), but doesn't properly fall back to a secondary 
superblock after trying to mount using the primary.


As far as btrfs check repair trying to fix this, I don't think that it 
does so currently, probably for the same reason that mount fails.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: Single disk parrallelization


On 2014-09-19 14:10, Jeb Thomson wrote:

With the advanced features of btrfs, it would be an additional simple task to 
make different platters run in parallel.

In this case, say a disk has three platters, and so three seek heads as well. 
If we can identify that much, and what offsets they are at, it then becomes a 
trivial matter to place the reads and writes to different platters at the same 
time.

In affect, this means each platter should be operating as a single virtualized 
unit, instead of one single unit...


In theory this is a great idea except for two things:
1) Most consumer drives have only one platter.
2) The kernel doesn't have such low-level hardware access, so it would 
have to be implemented in device firmware (and I'd be willing to bet 
that most drive manufacturers already stripe data across multiple 
platters when possible).





smime.p7s
Description: S/MIME Cryptographic Signature

Re: general thoughts and questions + general and RAID5/6 stability?

2014-09-23 Thread Austin S Hemmelgarn


On 2014-09-22 16:51, Stefan G. Weichinger wrote:

Am 20.09.2014 um 11:32 schrieb Duncan:


What I do as part of my regular backup regime, is every few kernel cycles
I wipe the (first level) backup and do a fresh mkfs.btrfs, activating new
optional features as I believe appropriate.  Then I boot to the new
backup and run a bit to test it, then wipe the normal working copy and do
a fresh mkfs.btrfs on it, again with the new optional features enabled
that I want.


Is re-creating btrfs-filesystems *recommended* in any way?

Does that actually make a difference in the fs-structure?

I would recommend it, there are some newer features that you can only 
set at mkfs time.  Quite often, when a new feature is implemented, it is 
some time before things are such that it can be enabled online, and even 
then that doesn't convert anything until it is rewritten.

So far I assumed it was enough to keep the kernel up2date, use current
(stable) btrfs-progs and run some scrub every week or so (not to mention
backups .. if it ain't backed up, it was/isn't important).

Stefan







smime.p7s
Description: S/MIME Cryptographic Signature

Re: general thoughts and questions + general and RAID5/6 stability?

2014-09-23 Thread Austin S Hemmelgarn


On 2014-09-23 09:06, Stefan G. Weichinger wrote:

Am 23.09.2014 um 14:08 schrieb Austin S Hemmelgarn:

On 2014-09-22 16:51, Stefan G. Weichinger wrote:

Is re-creating btrfs-filesystems *recommended* in any way?

Does that actually make a difference in the fs-structure?


I would recommend it, there are some newer features that you can only
set at mkfs time.  Quite often, when a new feature is implemented, it is
some time before things are such that it can be enabled online, and even
then that doesn't convert anything until it is rewritten.


What features for example?
Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the 
following list of features:

mixed-bg- mixed data and metadata block groups
extref  - increased hard-link limit per file to 65536
raid56  - raid56 extended format
skinny-metadata - reduced size metadata extent refs
no-holes- no explicit hole extents for files

mixed-bg is something that you generally wouldn't want to change after mkfs.
extref can be enabled online, and the filesystem metadata gets updated 
as-needed, and dosen't provide any real performance improvement (but is 
needed for some mail servers that have HUGE mail-queues)
I don't know anything about the raid56 option, but there isn't any way 
to change it after mkfs.
skinyy-metadata can be changed online, and the format gets updated on 
rewrite of each metadata block.  This one does provide a performance 
improvement (stat() in particular runs noticeably faster).  You should 
probably enable this if it isn't already enabled, even if you don't 
recreate your filesystem.
no-holes cannot currently be changed online, and is a very recent 
addition (post v3.14 btrfs-progs I believe) that provides improved 
performance for sparse files (which is particularly useful if you are 
doing things with fixed size virtual machine disk images).


It's this last one that prompted me personally to recreate my 
filesystems most recently, as I use sparse files to save space as much 
as possible.


I created my main btrfs a few months ago and would like to avoid
recreating it as this would mean restoring my root-fs on my main
workstation.

Although I would do it if it is worth it ;-)

I assume I could read some kind of version number out of the superblock
or so?

btrfs-show-super ?

AFAIK there isn't really any 'version number' that has any meaning in 
the superblock (except for telling the kernel that it uses the stable 
disk layout), however, there are flag bits that you can look for 
(compat_flags, compat_ro_flags, and incompat_flags).  I'm not 100% 
certain what each bit means, but on my system with a only 1 month old 
BTRFS filesystem, with extref, skinny-metadata, and no-holes turned on, 
i have compat_flags: 0x0, compat_ro_flags: 0x0, and incompat_flags: 0x16b.


The other potentially significant thing is that the default 
nodesize/leafsize has changed recently from 4096 to 16384, as that gives 
somewhat better performance for most use cases.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: general thoughts and questions + general and RAID5/6 stability?

2014-09-23 Thread Austin S Hemmelgarn


On 2014-09-23 10:23, Tobias Holst wrote:

If it is unknown, which of these options have been used at btrfs
creation time - is it possible to check the state of these options
afterwards on a mounted or unmounted filesystem?


2014-09-23 15:38 GMT+02:00 Austin S Hemmelgarn ahferro...@gmail.com
mailto:ahferro...@gmail.com:

Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives
the following list of features:
mixed-bg- mixed data and metadata block groups
extref  - increased hard-link limit per file to 65536
raid56  - raid56 extended format
skinny-metadata - reduced size metadata extent refs
no-holes- no explicit hole extents for files

I don't think there is a specific tool for doing this, but some of them 
do show up in dmesg, for example skinny-metadata shows up as a mention 
of the FS having skinny extents.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?

On 2014-10-08 15:11, Eric Sandeen wrote:

I was looking at Marc's post:

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?

Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:

* btrfs scrub
Errors are corrected along if possible (what *is* possible?)
* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount
time.
* mount -o degraded
Allow mounts to continue with missing devices.
(This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
remove the log tree if log tree is corrupt
* btrfs rescue
Recover a damaged btrfs filesystem
chunk-recover
super-recover
How does this relate to btrfs check?
* btrfs check
repair a btrfs filesystem
--repair
--init-csum-tree
--init-extent-tree
How does this relate to btrfs rescue?
* btrfs restore
try to salvage files from a damaged filesystem
(not really repair, it's disk-scraping)

What's the vision for, say, scrub vs. check vs. rescue? Should they repair the
same errors, only online vs. offline? If not, what class of errors does one
fix vs.
the other? How would an admin know? Can btrfs check recover a bad tree root
in the same way that mount -o recovery does? How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?

Well, based on my understanding:
* btrfs scrub is intended to be almost exactly equivalent to scrubbing a
RAID volume; that is, it fixes disparity between multiple copies of the
same block. IOW, it isn't really repair per se, but more preventative
maintnence. Currently, it only works for cases where you have multiple
copies of a block (dup, raid1, and raid10 profiles), but support is
planned for error correction of raid5 and raid6 profiles.
* mount -o recovery I don't know much about, but AFAICT, it s more for
dealing with metadata related FS corruption.
* mount -o degraded is used to mount a fs configured for a raid storage
profile with fewer devices than the profile minimum. It's primarily so
that you can get the fs into a state where you can run 'btrfs device
replace'
* btrfs-zero-log only deals with log tree corruption. This would be
roughly equivalent to zeroing out the journal on an XFS or ext4
filesystem, and should almost never be needed.
* btrfs rescue is intended for low level recovery corruption on an
offline fs.
* chunk-recover I'm not entirely sure about, but I believe it's
like scrub for a single chunk on an offline fs
* super-recover is for dealing with corrupted superblocks, and
tries to replace it with one of the other copies (which hopefully isn't
corrupted)
* btrfs check is intended to (eventually) be equivalent to the fsck
utility for most other filesystems. Currently, it's relatively good at
identifying corruption, but less so at actually fixing it. There are
however, some things that it won't catch, like a superblock pointing to
a corrupted root tree.
* btrfs restore is essentially disk scraping, but with built-in
knowledge of the filesystem's on-disk structure, which makes it more
reliable than more generic tools like scalpel for files that are too big
to fit in the metadata blocks, and it is pretty much essential for
dealing with transparently compressed files.

In general, my personal procedure for handling a misbehaving BTRFS
filesystem is:
* Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify
what's wrong

* Try mounting it using -o recovery
* Try mounting it using -o ro,recovery
* Use -o degraded only if it's a BTRFS raid set that lost a disk
* If btrfs check AND dmesg both seem to indicate that the log tree is
corrupt, try btrfs-zero-log
* If btrfs check indicated a corrupt superblock, try btrfs rescue
super-recover

* If all of the above fails, ask for advice on the mailing list or IRC
Also, you should be running btrfs scrub regularly to correct bit-rot and
force remapping of blocks with read errors. While BTRFS technically
handles both transparently on reads, it only corrects thing on disk when
you do a scrub.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?


On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct bit-rot
and force remapping of blocks with read errors.  While BTRFS
technically handles both transparently on reads, it only corrects thing
on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors.  It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on 
disk when it detects an error during a read, I know it doesn't it the fs 
is mounted ro (even if the media is writable), because I did some 
testing to see how 'read-only' mounting a btrfs filesystem really is.


Also, that's a much better description of how multiple copies work than 
I could probably have ever given.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?


On 2014-10-09 08:12, Hugo Mills wrote:

On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:

On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct bit-rot
and force remapping of blocks with read errors.  While BTRFS
technically handles both transparently on reads, it only corrects thing
on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors.  It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on disk
when it detects an error during a read,


I'm fairly sure it does, as I've had it happen to me. :)
I probably just misinterpreted the source code, while I know enough C to 
generally understand things, I'm by far no expert.



I know it doesn't it the fs is
mounted ro (even if the media is writable), because I did some testing to
see how 'read-only' mounting a btrfs filesystem really is.


If the FS is RO, then yes, it won't fix things.

Hugo.






smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?