Re: Feature requests: online backup - defrag - change RAID level

General Zed Thu, 12 Sep 2019 22:06:28 -0700


Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:

On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:


Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:

> Don't forget you have to write new checksum and free space tree pages.
> In the worst case, you'll need about 1GB of new metadata pages for each
> 128MB you defrag (though you get to delete 99.5% of them immediately
> after).

Yes, here we are debating some worst-case scenaraio which is actually
imposible in practice due to various reasons.


No, it's quite possible.  A log file written slowly on an active
filesystem above a few TB will do that accidentally.  Every now and then
I hit that case.  It can take several hours to do a logrotate on spinning
arrays because of all the metadata fetches and updates associated with
worst-case file delete.  Long enough to watch the delete happen, and
even follow along in the source code.

I guess if I did a proactive defrag every few hours, it might take less
time to do the logrotate, but that would mean spreading out all the
seeky IO load during the day instead of getting it all done at night.
Logrotate does the same job as defrag in this case (replacing a file in
thousands of fragments spread across the disk with a few large fragments
close together), except logrotate gets better compression.

To be more accurate, the example I gave above is the worst case you
can expect from normal user workloads.  If I throw in some reflinks
and snapshots, I can make it arbitrarily worse, until the entire disk
is consumed by the metadata update of a single extent defrag.


I can't believe I am considering this case.

So, we have a 1TB log file "ultralog" split into 256 million 4 KBextents randomly over the entire disk. We have 512 GB free RAM and 2%free disk space. The file needs to be defragmented.

In order to do that, defrag needs to be able to copy-move multipleextents in one batch, and update the metadata.

The metadata has a total of at least 256 million entries, each of somesize, but each one should hold at least a pointer to the extent (8bytes) and a checksum (8 bytes): In reality, it could be that there isa lot of other data there per entry.

The metadata is organized as a b-tree. Therefore, nearby nodes shouldcontain data of consecutive file extents.

The trick, in this case, is to select one part of "ultralog" which islocalized in the metadata, and defragment it. Repeating this step willultimately defragment the entire file.

So, the defrag selects some part of metadata which is entirely adescendant of some b-tree node not far from the bottom of b-tree. Itselects it such that the required update to the metadata is less than,let's say, 64 MB, and simultaneously the affected "ultralog" filefragments total less han 512 MB (therefore, less than 128 thousandmetadata leaf entries, each pointing to a 4 KB fragment). Then itfinds all the file extents pointed to by that part of metadata. Theyare consecutive (as file fragments), because we have selected suchpart of metadata. Now the defrag can safely copy-move those fragmentsto a new area and update the metadata.

In order to quickly select that small part of metadata, the defragneeds a metatdata cache that can hold somewhat more than 128 thousandlocalized metadata leaf entries. That fits into 128 MB RAM definitely.

Of course, there are many other small issues there, but this outlinesthe general procedure.


Problem solved?

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to