Re: BTRFS for OLTP Databases

Lionel Bouton Tue, 07 Feb 2017 08:29:14 -0800

Hi Peter,

Le 07/02/2017 à 15:13, Peter Zaitsev a écrit :
> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.
>
> I have not seen autodefrag helping much but I will try again.     Is
> there any autodefrag documentation available about how is it expected
> to work and if it can be tuned in any way


There's not much that can be done if the same file is modified in 2
different subvolumes (typically the original and a R/W snapshot). You
either break the reflink around the modification to limit the amount of
fragmentation (which will use disk space and write I/O) or get
fragmentation on at least one subvolume (which will add seeks).
So the only options are either to flatten the files (which can be done
incrementally by defragmenting them on both sides when they change) or
only defragment the most used volume (especially if the other is a
relatively short-lived snapshot where performance won't degrade much
until it is removed and won't matter much).

I just modified our defragmenter scheduler to be aware of multiple
subvolumes and support ignoring some of them. The previous version (not
tagged, sorry) was battle tested on a Ceph cluster and was designed for
it. Autodefrag didn't work with Ceph with our workload (latency went
through the roof, OSDs were timing out requests, ...) and our scheduler
with some simple Ceph BTRFS related tunings gave us even better
performance than XFS (which is usually the recommended choice with
current Ceph versions).

The current version is probably still rough around the edges as it is
brand new (most of the work was done last Sunday) and only running on a
backup server with a situation not much different from yours : a large
PostgreSQL slave (>50GB) which is snapshoted hourly and daily, with a
daily snapshot used to start a PostgreSQL instance for "tests on real
data" purposes + a copy of a <10TB NFS server with similar snapshots in
place. All of this is on a single RAID10 13-14TB BTRFS.
In our case using autodefrag on this slowly degraded performance to the
point where off-site backups became slow enough to warrant preventive
measures.
The current scheduler looks for the mountpoints of top BTRFS volumes (so
you have to mount the top volume somewhere), and defragments them avoiding :
- read-only snapshots,
- all data below configurable subdirs (including read-write subvolumes
even if they are mounted elsewhere), see README.md for instructions.

It slowly walks all files eligible for defragmentation and in parallel
detects writes to the same filesystem, including writes to read-write
subvolumes mounted elsewhere to trigger defragmentation. The scheduler
uses an estimated "cost" for each file to prioritize defragmentation
tasks and with default settings tries to keep I/O activity low enough
that it doesn't slow down other tasks too much. However it defragments
files whole, which might put some strain for huge ibdata* files if you
didn't switch to file per table. In our case defragmenting 1GB files is
OK and doesn't have a major impact.

We are already seeing better performance (our total daily backup time is
below worrying levels again) and the scheduler didn't even finish
walking the whole filesystem (there are approximately 8 millions files
and it is configured to evaluate them over a week). This is probably
because it follows the most write-active files (which are in the
PostgreSQL slave directory) and defragmented most of them early.

Note that it is tuned for filesystems using ~2TB 7200rpm drives (there
are some options that will adapt it to subsystems with more I/O
capacity). Using drives with different capacities shouldn't need tuning,
but it probably will not work well on SSD (it should be configured to
speed up significantly).

See https://github.com/jtek/ceph-utils you want btrfs-defrag-scheduler.rb

Some parameters are available (start it with --help). You should
probably start it with --verbose at least until you are comfortable with
it to get a list of which files are defragmented and many debug messages
you probably want to ignore (or you'll probably have to read the Ruby
code to fully understand what they mean).

I don't provide any warranty for it but the worst I believe can happen
is no performance improvements or performance degradation until you stop
it. If you don't blacklist read-write snapshots with the .no-defrag file
(see README.md) defragmentation will probably eat more disk space than
usual. Space usage will go up rapidly during defragmentation if you have
snapshots, it is supposed to go down after all snapshots referring to
fragmented files are removed and replaced by new snapshots (where
fragmentation should be more stable).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

Reply via email to