On 2017-01-16 10:42, Christoph Groth wrote:
Austin S. Hemmelgarn wrote:
On 2017-01-16 06:10, Christoph Groth wrote:
root@mim:~# btrfs fi df /
Data, RAID1: total=417.00GiB, used=344.62GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=40.00MiB, used=68.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=3.00GiB, used=1.35GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=464.00MiB, used=0.00B
Just a general comment on this, you might want to consider running a
full balance on this filesystem, you've got a huge amount of slack
space in the data chunks (over 70GiB), and significant space in the
Metadata chunks that isn't accounted for by the GlobalReserve, as well
as a handful of empty single profile chunks which are artifacts from
some old versions of mkfs. This isn't of course essential, but
keeping ahead of such things does help sometimes when you have issues.
Thanks! So slack is the difference between "total" and "used"? I saw
that the manpage of "btrfs balance" explains this a bit in its
"examples" section. Are you aware of any more in-depth documentation?
Or one has to look at the source at this level?
There's not really much in the way of great documentation that I know
of. I can however cover the basics here:
BTRFS uses a 2 level allocation system. At the higher level, you have
chunks. These are just big blocks of space on the disk that get used
for only one type of lower level allocation (Data, Metadata, or System).
Data chunks are normally 1GB, Metadata 256MB, and System depends on
the size of the FS when it was created. Within these chunks, BTRFS then
allocates individual blocks just like any other filesystem. When there
is no free space in any existing chunks for a new block that needs
allocated, a new chunk is allocated. Newly allocated chunks may be
larger (if the filesystem is really big) or smaller (if the FS doesn't
have much free space left at the chunk level) than the default. In the
event that BTRFS can't allocate a new chunk because there's no room, a
couple of different things could happen. If the chunk to be allocated
was a data chunk, you get -ENOSPC (usually, sometimes you might get
other odd results) in the userspace application that triggered the
allocation. However, if BTRFS needs room for metadata, then it will try
to use the GlobalReserve instead. This is a special area within the
metadata chunks that's reserved for internal operations and trying to
get out of free space exhaustion situations. If that fails, then the
filesystem is functionally dead, reads will still work, and you might be
able to write very small amounts of data at a time, but it's not
possible from a practical perspective to recover a filesystem in such a
situation.
The 'total' value in fi df output is the total space allocated to chunks
of that type, while the 'used' value is how much is actually being used.
It's worth noting that since GlobalReserve is a part of the Metadata
chunks, the total there is part of the total for Metadata, but not the
used value (so in an ideal situation with no slack space at the block
level, you would still see a difference between metadata total and used
equal to the global reserve total).
What balancing does is send everything back through the allocator, which
in turn back-fills chunks that are only partially full, and removes ones
that are now empty. In normal usage, it's not absolutely needed. From
a practical perspective though, it's generally a good idea to keep the
slack space (the difference between total and used) within chunks to a
minimum to try and avoid getting the filesystem stuck with no free space
at the chunk level.
I ran
btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /
btrfs balance start -dusage=25 -musage=25 /
This resulted in
root@mim:~# btrfs fi df /
Data, RAID1: total=365.00GiB, used=344.61GiB
System, RAID1: total=32.00MiB, used=64.00KiB
Metadata, RAID1: total=2.00GiB, used=1.35GiB
GlobalReserve, single: total=460.00MiB, used=0.00B
This is a much saner looking FS, you've only got about 20GB of slack in
Data chunks, and less than 1GB in metadata, which is reasonable given
the size of the FS and how much data you have on it. Ideal values for
both are actually hard to determine, as having no slack in the chunks
actually hurts performance a bit, and the ideal values depend on how
much your workloads hit each type of chunk.
I hope that one day there will be a daemon that silently performs all
the necessary btrfs maintenance in the background when system load is low!
FWIW, while there isn't a daemon yet that does this, it's a perfect
thing for a cronjob. The general maintenance regimen that I use for
most of my filesystems is:
* Run 'btrfs balance start -dusage=20 -musage=20' daily. This will
complete really fast on most filesystems, and keeps the slack-space
relatively under-control (and has the nice bonus that it helps
defragment free space.
* Run a full scrub on all filesystems weekly. This catches silent
corruption of the data, and will fix it if possible.
* Run a full defrag on all filesystems monthly. This should be run
before the balance (reasons are complicated and require more explanation
than you probably care for). I would run this at least weekly though on
HDD's, as they tend to be more negatively impacted by fragmentation.
There are a couple of other things I also do (fstrim and punching holes
in large files to make them sparse), but they're not really BTRFS
specific. Overall, with a decent SSD (I usually use Crucial MX series
SSD's in my personal systems), these have near zero impact most of the
time, and with decent HDD's, you should have limited issues as long as
you run on only one FS at a time.
* So scrubbing is not enough to check the health of a btrfs file
system? It’s also necessary to read all the files?
Scrubbing checks data integrity, but not the state of the data. IOW,
you're checking that the data and metadata match with the checksums,
but not necessarily that the filesystem itself is valid.
I see, but what should one then do to detect problems such as mine as
soon as possible? Periodically calculate hashes for all files? I’ve
never seen a recommendation to do that for btrfs.
Scrub will verify that the data is the same as when the kernel
calculated the block checksum. That's really the best that can be done.
In your case, it couldn't correct the errors because both copies of
the corrupted blocks were bad (this points at an issue with either RAM
or the storage controller BTW, not the disks themselves). Had one of
the copies been valid, it would have intelligently detected which one
was bad and fixed things. It's worth noting that the combination of
checksumming and scrub actually provides more stringent data integrity
guarantees than any other widely used filesystem except ZFS.
As far as general monitoring, in addition to scrubbing (and obviously
watching SMART status) you want to check the output of 'btrfs device
stats' for non-zero error counters (these are cumulative counters that
are only reset when the user says to do so, so right now they'll show
aggregate data for the life of the FS), and if you're paranoid, watch
that the mount options on the FS don't change (some monitoring software
such as Monit makes this insanely easy to do), as the FS will go
read-only if a severe error is detected (stuff like a failed read at the
device level, not just checksum errors).
There are a few things you can do to mitigate the risk of not using
ECC RAM though:
* Reboot regularly, at least weekly, and possibly more frequently.
* Keep the system cool, warmer components are more likely to have
transient errors.
* Prefer fewer numbers of memory modules when possible. Fewer modules
means less total area that could be hit by cosmic rays or other
high-energy radiation (the main cause of most transient errors).
Thanks for the advice, I think I buy the regular reboots.
As a consequence of my problem I think I’ll stop using RAID1 on the file
server, since this only protects against dead disks, which evidently is
only part of the problem. Instead, I’ll make sure that the laptop that
syncs with the server has a SSD that is big enough to hold all the data
that is on the server as well (1 TB SSDs are affordable now). This way,
instead of disk-level redundancy, I’ll have machine-level redundancy.
When something like the current problem hits one of the two machines, I
should still have a usable second machine with all the data on it.
I actually have a similar situation, I've got a laptop that I back-up to
a personal server system. In my case though, I've take a much
higher-level approach, the backup storage is in fact GlusterFS (a
clustered filesystem) running on top of BTRFS on 3 different systems
(the server, plus a pair of Intel NUC's that are just dedicated SAN
systems). If I didn't have the hardware to do this or cared about
performance more (I'm lucky if I get 20MB/s write speed, but most of the
issue is that I went cheap on the NUC's), I would probably still be
using BTRFS in raid1 mode on the server despite keeping a copy on the
laptop, simply because that provides an extra layer of protection on the
server side.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html