relocation.c:242!

Robert White Sat, 13 Dec 2014 15:57:06 -0800

On 12/13/2014 01:52 PM, Tomasz Chmielewski wrote:

On 2014-12-13 21:54, Robert White wrote:

- rsync many remote data sources (-a -H --inplace --partial) + snapshot


Using --inplace on a Copy On Write filesystem has only one effect, it
increases fragmentation... a lot...


...if the file was changed.

If the file hasn't changed then it won't be be transferred bydefinition. So the un-changed file is not terribly interesting.

And I did think about the rest of (most of) your points right aftersending the original email. Particularly since I don't know your actualuse case. But there is no "un-send" which I suddenly realized I wantedto do... Because I needed to change my answer. Like ten seconds later./sigh.


I'm still strongly against forcing compression.

That said, my knee-jerk reaction to using --inplace is still strong foralmost all file types.

And it remains almost absolute in your case simply because you arefinding yourself needing to balance and whatnot.

E.g. the theoretical model of efficent partial copies as you present isfine... up until we get back your original complain about what a mess itmakes.

The ruling precept here is Ben Franklin's "penny wise, pound foolish".What you _might_ be saving up-front with --inplace is charging youdouble on the back-end with maintenance.

Every new block is going to get
written to a new area anyway,


Exactly - "every new block". But that's true with and without --inplace.
Also - without --inplace, it is "every block". In other words, without
--inplace, the file is likely to be rewritten by rsync to a new one, and
CoW is lost (more below).

I don't know the nature of the particular files you are translating, butI do know a lot about rsync and file layout in general for lots ofdifferent types of files.

(rsync details here for readers-along ::http://rsync.samba.org/how-rsync-works.html )

Now I am assuming you took this advice from something like the manualpage [QUOTE] This [--inplace] option is useful for transferring largefiles with block-based changes or appended data, and also on systemsthat are disk bound, not network bound. It can also help keep acopy-on-write filesystem snapshot from diverging the entire contents ofa file that only has minor changes.[/QUOTE] Though maybe not since thedescription goes on to say that --inplace implies --partial sospecifying both is redundant.

But here's the thing, those files are really rare. Way more rare thanyou might think. The consist almost entirely of block-based databaseextents (like an oracle tablespace file), logfiles (such as is found in/var/log/messages etc.), VM disk image files (particularly raw images)and ISO images that are _only_ modified by adding tracks may fall intothis category as well..



So we've already skipped the unchanged files...

So, inserting a single byte into, or removing a single byte from, anyfile will cause a re-write from that point on. It will send that filefrom the block boundary containing that byte. Just about anything with aheader and a history is going to get re-sent almost completely. Thisincludes the output from any word processing program you are likely toencounter.

Anything with linear compression (such as Open Document Format, which isbasically a ZIP file) will be resent entirely.

All compiled programs binaries will be resent entirely if the programchanged at all (the headers again, the changes in text segments, thechanges in layout that a single byte difference in size cause the elfformats, or the dll formats to juggle significantly.)


And I could go on at length, but I'll skip that...

And _then_ the forced compression comes into play.

Rsync is going to impose its default block size to frame changes (see--block-size=) and then BTRFS is going to impose it's compression framesizes (presuming it is done by block size). If these are not exactly tehsame size any rsync block that updates will result in one or two "extra"compression blocks being re-written by the tiling overlap effect.

so if you have enough slack space to
keep the one new copy of the new file, which you will probably use up
anyway in the COW event, laying in the fresh copy in a likely more
contiguous way will tend to make things cleaner over time.

--inplace is doubly useless with compression as compression is
perturbed by default if one byte changes in the original file.


No. If you change 1 byte in a 100 MB file, or perhaps 1 GB file, you
will likely loose a few kBs of CoW. The whole file is certainly not
rewritten if you use --inplace. However it will be wholly rewritten if
you don't use --inplace.

The only time --inplace might be helpful is if the file is NOCOW...
except...


No, you're wrong.
By default, rsync creates a new file if it detects any file modification
- like "touch file".

Consider this experiment:

# create a "large file"
dd if=/dev/urandom of=bigfile bs=1M count=3000

# copy it with rsync
rsync -a -v --progress bigfile bigfile2

# copy it again - blazing fast, no change
rsync -a -v --progress bigfile bigfile2

# "touch" the original file
touch bigfile

touching an unchanged file is cheating... and would be better addressedby the --checksum argument (unless you have something that reallydepends on the dates and you've already assured that any restores won'tmess up the dates later anyway). --checksum, of course, slows down thefile selection process.


# try copying again with rsync - notice rsync creates a temp file, like
.bigfile2.J79ta2
# No change to the file except the timestamp, but good bye your CoW.
rsync -a -v --progress bigfile bigfile2

# Now try the same with --inplace; compare data written to disk with
iostat -m in both cases.


Same goes for append files - even if they are compressed, most CoW will
be shared. I'd say it will be similar for lightly modified files
(changed data will be CoW-unshared, some compressed "overhead" will be
unshared, but the rest will be untouched / shared by CoW between the
snapshots).

So while it is apparent that we both know how rsync works, I wonder ifyou've checked how much of your data load actually has a chance tobenefit from --inplace and compared it to how much fragmentation it'slikely to cause.

===

The basic problem I have with --inplace is, space permitting, you end up"better off" over long periods of time in a COW filesystem if you don'tuse it.

Consider any append-mode file. With each incremental append amidst abulk transfer it will tend to have each increment separated from itsnext increment by one or more raw allocation extents. That, if nothingelse, will cause the extents to _never_ reach the empty state where theycan be reclaimed automatically.

If we were making an infographic-style representation of your diskstorage the files with long histories would tend to look like lightningstrikes "down the page" or "vertical scribbling up-an-down" instead ofsolid bars across the little chunks.

As snapshots come and go, the copied-and-replaced little bars would bememorialized for a time and then go away. The up and down result issemi-permanent, requiring you to do internal maintenance nonsense to tryto coalesce the scribbles into bars.

So every time you defrag and balance the drive you are just taking stepsto undo the geographical harm that rsync --inplace caused in the firstplace. That increases the total effective write cost of the inplace intothe full copy cost _plus_ the incremental copy-on-overwrite cost spreadover repeated activities (defrags or balances etc).

Without --inplace you'd definitely be using up more room for the copies(at least until internal data de-duplication comes along, presuming itdoes), but those copies will go away as the snapshots age off leavinglarger chunks available for future allocation. The result of that willbe smaller trees (not that that matters) and larger gaps (which reallydoes matter) that the system can work with more optimally as time rollsforward forever.

In computer science (and other disciplines) "The Principle(s) ofLocality" are huge actors. It's the basis of caching at all levels andit features strongly in graph theory and simple mechanics.

So yea, ten snapshots of a 30GiB VM disk image that hardly changed atall would _suck_, and might be worth its own selective rsync for thesubdirectories where such things might happen; but turning a backup copyof your browser history file into a fifty-segment snail-trail wanderingall through your data extents is not to be taken lightly either.

The middle ground is to selectively rsync the (usually) very-fewdirectories that contain the files you _know_ will explicitly benefitfrom --inplace, such as /home/some_user/Virtual_Machines/*; then rsyncthe whole tree without the option. [The already synchronized directorywill be automatically seen as current and you'll get optimal results.]



ASIDE :: I hope you are also using --delete when you rsync your backups. 8-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

Reply via email to