On Tue, Feb 3, 2026 at 7:47 PM Gordan Bobic via discuss
<[email protected]> wrote:
> But I guess if you could quickly block clone everything and
> mariabackup is aware of it, then that would minimize the backup window
> during which the redo log is at risk of overflowing.

The current circular InnoDB WAL (ib_logfile0) would make the
block-clone a little tricky. If we could block all writes to the file
for a short time, then I think it could work.

In the new innodb_log_archive format, InnoDB would allocate a new log
file each time the current one is filling up. When the first
checkpoint is written to the new file, the old file will be made
read-only, to signal any tools that it is safe to hard-link that file.
Also the last file is safe to hard-link at any time, as the log
records will never be overwritten. However, a hard link of the last
(actively used) file would not be safe to be used for starting up a
new server, because we must not allow both the old and new server to
write to the same log file. That is why the last log file would have
to be copied or block-cloned as a final step of the backup (while not
blocking the server). https://jira.mariadb.org/browse/MDEV-37949
mentions a possible new parameter innodb_log_recovery_target, which
allows any extra writes in the last log file to be ignored.
Alternatively, we could invalidate the tail of the last log file by
writing at least one NUL byte at the desired end position. The new
server would then start writing from that LSN onwards.

> It has been a long time since I looked at btrfs, but I seem to vaguely
> recall that it's incrementals still involve reading the entire old and
> new files to compute the delta, which is very inefficient,
> particularly with databases where updating a single row means having
> to re-read the entire tablespace.
> ZFS is significantly more advanced than that and only has to read and
> send the blocks that have actually changed.

Thank you, this is very useful. Your description of incremental btrfs
transfer resembles the way how mariadb-backup --backup --incremental
currently works: it really reads all *.ibd files to find out which
pages have been changed.

With the innodb_log_archive format, you would basically only copy the
log that was written after the previous (full or incremental) backup
finished, and it would cover all changes to InnoDB files. This would
be analogous to the incremental ZFS snapshot transfer. However, the
binlog, the .frm files and the files of any other storage engines
would still have to be handled separately, until and unless an option
is implemented to persist everything via a single log.

> > I have also been thinking of implementing a live streaming backup in
> > the tar format. Perhaps, for performance reasons, there should be an
> > option to create multiple streams in parallel. I am yet to experiment
> > with this.
>
> I don't think tar can do that, which is why there is no such thing as
> a parallel tar.

Above, I was thinking of an option to split the content into multiple
tar streams, which could be processed in parallel.

> And tar can actually be a serious single-threaded bottleneck when you
> are using NVMe drives and 10G+ networking.

Can you think of anything that would allow efficient streaming using a
single TCP/IP connection in this kind of an environment?

> And the only real tunable only shifts it by about 33% (on x86-64 -
> other platforms may be different):
> https://shatteredsilicon.net/tuning-tar/
> And 33% doesn't really move the needle enough for large fast servers
> that run database 10s of terabytes in size.

As demonstrated in https://jira.mariadb.org/browse/MDEV-38362, some
more performance could be squeezed by using the Linux system calls
sendfile(2) or splice(2). Unfortunately, both system calls are limited
to copying 65536 bytes at a time. Such offloading is possible with the
tar format, because there is no CRC on the data payload, only on the
metadata.

I fear that we may need multiple streams, which would complicate the
interface. The simplest that I can come up with would be to specify
the number of streams as well as the name of a script:

BACKUP SERVER WITH 8 CLIENT '/path/to/my_script';

The above would reuse existing reserved words. The specified script
may make use of a unique parameter (stream number), something like the
following:

#!/bin/sh
zstd|ssh [email protected] "cat>$1.tar.zstd"

This kind of a format would allow full flexibility for any further
processing. For example, you could extract multiple streams in
parallel if you have fast storage:

for i in *.tar.zstd; do tar xf - "$i" --zstd -C /data & done

Streaming backup is something that I plan to work on after
implementing a BACKUP SERVER command that targets a mounted file
system. For that, I plan to primarily leverage the Linux
copy_file_range(2), which can copy (or block-clone) up to 2 gigabytes
per call, falling back to sendfile(2) and ultimately pread(2) and
write(2).

With best regards,

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to