At 08/28/2016 11:38 AM, Oliver Freyermuth wrote:
Dear btrfs experts,
I just tried to make use of btrfs send / receive for incremental backups (using
btrbk to simplify the process).
It seems that on my two machines, btrfs send gets stuck after transferring some
GiB - it's not fully halted, but instead of making full use of the available I/O,
I get something < 500 kiB on average,
which are just some "full speed spikes" with many seconds / minutes of no I/O
in between.
During this "halting", btrfs send eats one full CPU core.
A "perf top" shows this is spent in "find_parent_nodes" and "__merge_refs"
inside the kernel.
I am using btrfs-progs 4.7 and kernel 4.7.0.
Unknown bug, while unfortunately no good idea to solve yet.
I sent a RFC patch to completely disable shared extent detection, while
got strong objection.
I also submitted some other ideas on fixing it, while still got strong
objection. Objection includes this is a performance problem, not a
function problem and we should focus on function problem first and
postpone such performance problem.
And further more, Btrfs from the beginning of its design, focuses on
fast snapshot creation, and takes backref walk as sacrifice.
So it's not an easy thing to fix.
I googled a bit and found related patchwork
(https://patchwork.kernel.org/patch/9238987/) which seems to workaround high
load in this area and mentions a real solution is proposed but not yet there.
Since this affects two machines of mine and backupping my root volume would
take about 80 hours in case I can extrapolate the average rate, this means
btrfs send is unusable to me.
Can I assume this is a common issue which will be fixed in a later kernel
release (4.8, 4.9) or can I do something to my FS's to workaround this issue?
I don't expect there will be even an agreement on how to fix the problem
in v4.1x.
Fixes in send will lead to obvious speed improvement, while cause
incompatibility or super complex design.
Fixes in backref will lead to a backref rework, which normally comes
with new regression, and we are even unsure if it will really help.
If you just hate the super slow send, and can accept the extra space
usage, please try this RFC patch:
https://patchwork.kernel.org/patch/9245287/
This patch, just as its name, will completely stop same extent(reflink)
detection.
Which will cause more space usage, while it skipped the super time
consuming find_parent_nodes(), it should at least workaround your problem.
I have some other idea to fix it with less aggressive idea, while since
there is objection against it, I didn't code it further.
But, since there are *REAL* *WORLD* users reporting such problem, I
think I'd better restart the fix as an RFC.
Thanks,
Qu
One FS is only two weeks old, the other one now about 1 year. I did some
balancing at some points of time to have more unallocated space for trimming,
and used duperemove regularly to free space. One FS has skinny extents, the
other has not.
Mount options are "rw,noatime,compress=zlib,ssd,space_cache,commit=120".
Apart from that: No RAID or any other special configuration involved.
Cheers and any help appreciated,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html