Re: btrfs autodefrag?

2015-10-19 Thread Erkki Seppala
Hugo Mills  writes:
>It has to be disabled because if you enable it, there's a race
> condition: since you're overwriting existing data (rather than CoWing
> it), you can't update the checksums atomically. So, in the interests
> of consistency, checksums are disabled.

I suppose this has been suggested before, but couldn't it store both the
new and the old checksums and be satisfied if either of them match?

The user is probably not happy that a partial write is going to be
difficult to read from the device due to a checksum error, but there is
no promise of recently-overwritten data state with traditional
filesystems either in case of sudden powerdown, assuming there is no
data journaling..

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs autodefrag?

2015-10-19 Thread Paul Harvey
On 18 October 2015 at 16:46, Duncan <1i5t5.dun...@cox.net> wrote:
> Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:
>
>> Hi,
>>
>> On a desktop equipped with an ssd with one 100GB virtual image used
>> frequently, what do you recommend?
>> 1) nothing special, it is all fine as long as you have a recent kernel
>> (which I do)
>> 2) Disabling copy-on-write for just the VM image directory.
>> 3) autodefrag as a mount option.
>> 4) something else.
>>
>> I don't think this usecase is well documented therefore I asked this
>> question.

[snip]

> So ssd or spinning rust, there's serious conflicts between nocow and
> snapshotting that really must be taken into consideration if you're
> planning to both snapshot and nocow.

This is all spot on advice, but I just wanted to chime in to mention:
I've been experimenting with -
- Active working copy of VM image files are hosted on non-btrfs filesystems
- Regular scheduled rsync --inplace onto a btrfs subvol copy of the
file that *is* snapshotted and part of regular send/receive runs.

rsync --inplace does what it says on the tin: it just rewrites those
parts of a file which need to be updated. Thus it only gets written to
once prior to each snapshot run, rather than continuously.

So the theory is that I can retain CoW storage efficiency (hold lots
of snapshots cheaply) but still keep decent performance (by running
the active, in-use working copies outside of my normal snapshotted
btrfs filesystems).

The cost is obviously more filesystems than you'd normally have to
run, more complex disaster recovery, not to mention storage sizing has
to accommodate a working copy on a separate fs to the archived copies.
Plus, this rsync approach has noticeably bigger I/O overhead than
btrfs send/receive, although in my environment nobody is noticing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs autodefrag?

2015-10-19 Thread Austin S Hemmelgarn

On 2015-10-19 02:19, Erkki Seppala wrote:

Hugo Mills  writes:

It has to be disabled because if you enable it, there's a race
condition: since you're overwriting existing data (rather than CoWing
it), you can't update the checksums atomically. So, in the interests
of consistency, checksums are disabled.


I suppose this has been suggested before, but couldn't it store both the
new and the old checksums and be satisfied if either of them match?
Actually, I don't think that's been suggested before, read on however 
for an explanation of why we don't do that.


The user is probably not happy that a partial write is going to be
difficult to read from the device due to a checksum error, but there is
no promise of recently-overwritten data state with traditional
filesystems either in case of sudden powerdown, assuming there is no
data journaling..
And that is exactly the case with how things are now, when something is 
marked NOCOW, it has essentially zero guarantee of data consistency 
after a crash.  As things are now though, there is a guarantee that you 
can still read the file, but using checksums like you suggest would 
result in it being unreadable most of the time, because it's 
statistically unlikely that we wrote the _whole_ block (IOW, we can't 
guarantee without COW that the data was completely written) because:
a. While some disks do atomically write single sectors, most don't, and 
if the power dies during the disk writing a single sector, there is no 
certainty exactly what that sector will read back as.
b. Assuming that item a is not an issue, one block in BTRFS is usually 
multiple sectors on disk, and a majority of disks have volatile write 
caches, thus it is not unlikely that the power will die during the 
process of writing the block.
c. In the event that both items a and b are not an issue (for example, 
you have a storage controller with a non-volatile write cache, have 
write caching turned off on the disks, and it's a smart enough storage 
controller that it only removes writes from the cache after they 
return), then there is still the small but distinct possibility that the 
crash will cause either corruption in the write cache, or some other 
hardware related issue.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs autodefrag?

2015-10-19 Thread Erkki Seppala
Austin S Hemmelgarn  writes:

> And that is exactly the case with how things are now, when something
> is marked NOCOW, it has essentially zero guarantee of data consistency
> after a crash.

Yes. In addition to the zero guarantee of the data validity for the data
being written into, btrfs also doesn't give any guarantees for the rest
of the data, even if it was perfectly quiescent, but was just marked COW
at the time it was written :).

>  As things are now though, there is a guarantee that
> you can still read the file, but using checksums like you suggest
> would result in it being unreadable most of the time, because it's
> statistically unlikely that we wrote the _whole_ block (IOW, we can't
> guarantee without COW that the data was completely written) because:

Well, the amount of data being written at any given time is very small
compared to the whole device. So it's not all the data that is at risk
of having the wrong checksum. Given how small blocks are (4k) I really
doubt that the likelihood of large amounts of data remaining unreadable
would be great.

However, here's a compromise: when detecting an error on a COW file,
instead of refusing to read it, produce a warning to the kernel log. In
addition, when scrubbing it, the last resort after trying other copies
the checksum could simply be repaired, paired with an appropriate log
message. Such a log message would not indicate that the data is wrong,
but that the system administrator might be interested in checking it,
for example against backups, or by perhaps running a scrub within the
virtual machine.

If the scrub would say everything is OK, then certainly everything would
be OK.

> a. While some disks do atomically write single sectors, most don't,
> and if the power dies during the disk writing a single sector, there
> is no certainty exactly what that sector will read back as.

So it seems that the majority vote is to not to provide a feature to the
minority.. :)

> b. Assuming that item a is not an issue, one block in BTRFS is usually
> multiple sectors on disk, and a majority of disks have volatile write
> caches, thus it is not unlikely that the power will die during the
> process of writing the block.

I'm not at all familiar with the on-disk structure of Btrfs, but it
seems that indeed the block size is 16 kilobytes by default, so the risk
of one of the four device-blocks (on modern 4kB-sector HDDs) being
corrupted or only a set of them having being written is real. But,
there's only so much data in-flight at any given time.

I did read that there are two checksums (on Wikipedia,
Btrfs#Checksum_tree..): one per block, and one per a contiguous run of
allocated blocks. The latter checksum seems more likely to be broken,
but I don't see why in that case the per-block checksums (or one of the
two checksums I proposed) couldn't be referred to. This is of course
because I don't understand much of the Btrfs on-disk format, technical
feasibility be damned :).

I understand that the metadata is always COW, so that level of
corruption cannot occur.

> c. In the event that both items a and b are not an issue (for example,
> you have a storage controller with a non-volatile write cache, have
> write caching turned off on the disks, and it's a smart enough storage
> controller that it only removes writes from the cache after they
> return), then there is still the small but distinct possibility that
> the crash will cause either corruption in the write cache, or some
> other hardware related issue.

However, should this not be the case, for example when my computer is
never brought down abruptly, it could still be valuable information to
see that the data has not changed behind my back.

I understand it is the prime motivation behind btrfs scrubbing in any
case; otherwise there could be a faster 'queue a verify after a write'
that would never scrub the same data twice.

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs autodefrag?

2015-10-19 Thread Austin S Hemmelgarn

On 2015-10-19 12:13, Erkki Seppala wrote:

Austin S Hemmelgarn  writes:


And that is exactly the case with how things are now, when something
is marked NOCOW, it has essentially zero guarantee of data consistency
after a crash.


Yes. In addition to the zero guarantee of the data validity for the data
being written into, btrfs also doesn't give any guarantees for the rest
of the data, even if it was perfectly quiescent, but was just marked COW
at the time it was written :).
Assuming you do actually mean COW and not NOCOW, in which case there is 
a guarantee that the data will either:

1. Match the original data prior to the write.
2. Match the data that was written.
or, if you are using only single copies of the metadata blocks and the 
system crashes exactly during a write to a metadata block:
3. Everything under that metadata block will become inaccessible, and 
require usage of btrfs-progs to recover.


In the case of NOCOW however, there is absolutely no such guarantee 
(just like ext4 for example can not provide such a guarantee), and any 
of the above could be the case, or any arbitrary portion of the new data 
could have been written.

  As things are now though, there is a guarantee that
you can still read the file, but using checksums like you suggest
would result in it being unreadable most of the time, because it's
statistically unlikely that we wrote the _whole_ block (IOW, we can't
guarantee without COW that the data was completely written) because:


Well, the amount of data being written at any given time is very small
compared to the whole device. So it's not all the data that is at risk
of having the wrong checksum. Given how small blocks are (4k) I really
doubt that the likelihood of large amounts of data remaining unreadable
would be great.
That very much depends on how you are using things.for many of the types 
of things which NOCOW should be used for, directio and AIO are also very 
commonly used, and those can write chunks much bigger than BTRFS's block 
size in one go.


However, here's a compromise: when detecting an error on a COW file,
instead of refusing to read it, produce a warning to the kernel log. In
addition, when scrubbing it, the last resort after trying other copies
the checksum could simply be repaired, paired with an appropriate log
message. Such a log message would not indicate that the data is wrong,
but that the system administrator might be interested in checking it,
for example against backups, or by perhaps running a scrub within the
virtual machine.
In this case I'm assuming you mean NOCOW instead of COW, as the 
corruption can't be detected in a NOCOW file by BTRFS.


In a significant majority of cases, it is actually better to return no 
data than to return known corrupted data (think medical or military 
applications, in those kind of cases it's quite often worse to act on 
incorrect data than it is to not act at all).  Disk images for virtual 
machines are one of the very few rare cases where this is not true, 
simply because they can usually correct the corruption themselves.


If the scrub would say everything is OK, then certainly everything would
be OK.
That's a _very_ optimistic point of view to take, and doesn't take into 
account software bugs, or potential hardware problems.



a. While some disks do atomically write single sectors, most don't,
and if the power dies during the disk writing a single sector, there
is no certainty exactly what that sector will read back as.


So it seems that the majority vote is to not to provide a feature to the
minority.. :)
For something that provides a false sense of data safety and is 
potentially easy to shoot yourself in the foot with?  Yes we will almost 
certainly not provide it.  If, however, you wish to write a patch to 
provide such a feature (or pay someone to do so for you), there is 
nothing stopping you from doing so, and if it's something that people 
actually want, then it will likely end up included.

b. Assuming that item a is not an issue, one block in BTRFS is usually
multiple sectors on disk, and a majority of disks have volatile write
caches, thus it is not unlikely that the power will die during the
process of writing the block.


I'm not at all familiar with the on-disk structure of Btrfs, but it
seems that indeed the block size is 16 kilobytes by default, so the risk
of one of the four device-blocks (on modern 4kB-sector HDDs) being
corrupted or only a set of them having being written is real. But,
there's only so much data in-flight at any given time.
While the default is usually 16k, there are situations where it may be 
different, for example if the system has a page size greater than 16k 
(some ARM64, PPC, and MIPS systems use 64k pages), or if it's a small 
filesystem (in which case the blocks will be 4k).


It is also worth noting that while most 'modern' HDDs use 4k sectors:
1. They are still vastly outnumbered by older HDDs that use 512 byte 
sectors.
2. A 

Re: btrfs autodefrag?

2015-10-18 Thread Xavier Gnata



On 18/10/2015 07:46, Duncan wrote:

Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:


Hi,

On a desktop equipped with an ssd with one 100GB virtual image used
frequently, what do you recommend?
1) nothing special, it is all fine as long as you have a recent kernel
(which I do)
2) Disabling copy-on-write for just the VM image directory.
3) autodefrag as a mount option.
4) something else.

I don't think this usecase is well documented therefore I asked this
question.


You are correct.  The VM images on ssd use-case /isn't/ particularly well
documented, I'd guess because people have differing opinions, and,
indeed, actual observed behavior, and thus recommendations even in the
ideal case, may well be different depending on the specs and firmware of
the ssd.  The documentation tends to be aimed at the spinning rust case.

There's one detail of the use-case (besides ssd specs), however, that you
didn't mention, that could have a big impact on the recommendation.  What
sort of btrfs snapshotting are you planning to do, and if you're doing
snapshots, does your use-case really need them to include the VM image
file?

Snapshots are a big issue for anything that you might set nocow, because
snapshot functionality assumes and requires cow, and thus conflicts, to
some extent, with nocow.  A snapshot locks in place the existing extents,
so they can no longer be modified.  On a normal btrfs cow-based file,
that's not an issue, since any modifications would be cowed elsewhere
anyway -- that's how btrfs normally works.  On a nocow file, however,
there's a problem, because once the snapshot locks in place the existing
version, the first change to a specific block (normally 4 KiB) *MUST* be
cowed, despite the nocow attribute, because to rewrite in-place would
alter the snapshot.  The nocow attribute remains in place, however, and
further writes to the same block will again be nocow... to the new block
location established by that first post-snapshot write... until the next
snapshot comes along and locks that too in-place, of course.  This sort
of cow-only-once behavior is sometimes called cow1.

If you only do very occasional snapshots, probably manually, this cow1
behavior isn't /so/ bad, tho the file will still fragment over time as
more and more bits of it are written and rewritten after the few
snapshots that are taken.  However, for people doing frequent, generally
schedule-automated snapshots, the nocow attribute is effectively
nullified as all those snapshots force cow1s over and over again.

So ssd or spinning rust, there's serious conflicts between nocow and
snapshotting that really must be taken into consideration if you're
planning to both snapshot and nocow.

For use-cases that don't require snapshotting of the nocow files, the
simplest workaround is to put any nocow files on dedicated subvolumes.
Since snapshots stop at subvolume boundaries, having nocow files on
dedicated subvolume(s) stops snapshots of the parent from including them,
thus avoiding the cow1 situation entirely.

If the use-case requires snapshotting of nocow files, the workaround that
has been reported (mostly on spinning rust, where fragmentation is a far
worse problem due to non-zero seek-times) to work is first to reduce
snapshotting to a minimum -- if it was going to be hourly, consider daily
or every 12 hours, if you can get away with it, if it was going to be
daily, consider every other day or weekly.  Less snapshotting means less
cow1s and thus directly affects how quickly fragmentation becomes a
problem.  Again, dedicated subvolumes can help here, allowing you to
snapshot the nocow files on a different schedule than you do the up-
hierarchy parent subvolume.  Second, schedule periodic manual defrags of
the nocow files, so the fragmentation that does occur is at least kept
manageable.  If the snapshotting is daily, consider weekly or monthly
defrags.  If it's weekly, consider monthly or quarterly defrags.  Again,
various people who do need to snapshot their nocow files have reported
that this really does help, keeping fragmentation to at least some sanely
managed level.

That's the snapshot vs. nocow problem in general.  With luck, however,
you can avoid snapshotting the files in question entirely, thus factoring
this issue out of the equation entirely.

Now to the ssd issue.

On ssds in general, there are two very major differences we need to
consider vs. spinning rust.  One, fragmentation isn't as much of a
problem as it is on spinning rust.  It's still worth keeping to a
minimum, because as the number of fragments increases, so does both btrfs
and device overhead, but it's not the nearly everything-overriding
consideration that it is on spinning rust.

Two, ssds have a limited write-cycle factor to consider, where with
spinning rust the write-cycle limit is effectively infinite... at least
compared to the much lower limit of ssds.

The weighing of these two overriding ssd factors one against the other,
along 

Re: btrfs autodefrag?

2015-10-18 Thread Hugo Mills
On Sun, Oct 18, 2015 at 10:24:39AM -0400, Rich Freeman wrote:
> On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnata  wrote:
> > 2) Disabling copy-on-write for just the VM image directory.
> 
> Unless this has changed, doing this will also disable checksumming.  I
> don't see any reason why it has to, but it does.  So, I avoid using
> this at all costs.

   It has to be disabled because if you enable it, there's a race
condition: since you're overwriting existing data (rather than CoWing
it), you can't update the checksums atomically. So, in the interests
of consistency, checksums are disabled.

   Hugo.

-- 
Hugo Mills | Nothing wrong with being written in Perl... Some of
hugo@... carfax.org.uk | my best friends are written in Perl.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  dark


signature.asc
Description: Digital signature


Re: btrfs autodefrag?

2015-10-18 Thread Rich Freeman
On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnata  wrote:
> 2) Disabling copy-on-write for just the VM image directory.

Unless this has changed, doing this will also disable checksumming.  I
don't see any reason why it has to, but it does.  So, I avoid using
this at all costs.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs autodefrag?

2015-10-17 Thread Duncan
Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:

> Hi,
> 
> On a desktop equipped with an ssd with one 100GB virtual image used
> frequently, what do you recommend?
> 1) nothing special, it is all fine as long as you have a recent kernel
> (which I do)
> 2) Disabling copy-on-write for just the VM image directory.
> 3) autodefrag as a mount option.
> 4) something else.
> 
> I don't think this usecase is well documented therefore I asked this
> question.

You are correct.  The VM images on ssd use-case /isn't/ particularly well 
documented, I'd guess because people have differing opinions, and, 
indeed, actual observed behavior, and thus recommendations even in the 
ideal case, may well be different depending on the specs and firmware of 
the ssd.  The documentation tends to be aimed at the spinning rust case.

There's one detail of the use-case (besides ssd specs), however, that you 
didn't mention, that could have a big impact on the recommendation.  What 
sort of btrfs snapshotting are you planning to do, and if you're doing 
snapshots, does your use-case really need them to include the VM image 
file?

Snapshots are a big issue for anything that you might set nocow, because 
snapshot functionality assumes and requires cow, and thus conflicts, to 
some extent, with nocow.  A snapshot locks in place the existing extents, 
so they can no longer be modified.  On a normal btrfs cow-based file, 
that's not an issue, since any modifications would be cowed elsewhere 
anyway -- that's how btrfs normally works.  On a nocow file, however, 
there's a problem, because once the snapshot locks in place the existing 
version, the first change to a specific block (normally 4 KiB) *MUST* be 
cowed, despite the nocow attribute, because to rewrite in-place would 
alter the snapshot.  The nocow attribute remains in place, however, and 
further writes to the same block will again be nocow... to the new block 
location established by that first post-snapshot write... until the next 
snapshot comes along and locks that too in-place, of course.  This sort 
of cow-only-once behavior is sometimes called cow1.

If you only do very occasional snapshots, probably manually, this cow1 
behavior isn't /so/ bad, tho the file will still fragment over time as 
more and more bits of it are written and rewritten after the few 
snapshots that are taken.  However, for people doing frequent, generally 
schedule-automated snapshots, the nocow attribute is effectively 
nullified as all those snapshots force cow1s over and over again.

So ssd or spinning rust, there's serious conflicts between nocow and 
snapshotting that really must be taken into consideration if you're 
planning to both snapshot and nocow.

For use-cases that don't require snapshotting of the nocow files, the 
simplest workaround is to put any nocow files on dedicated subvolumes.  
Since snapshots stop at subvolume boundaries, having nocow files on 
dedicated subvolume(s) stops snapshots of the parent from including them, 
thus avoiding the cow1 situation entirely.

If the use-case requires snapshotting of nocow files, the workaround that 
has been reported (mostly on spinning rust, where fragmentation is a far 
worse problem due to non-zero seek-times) to work is first to reduce 
snapshotting to a minimum -- if it was going to be hourly, consider daily 
or every 12 hours, if you can get away with it, if it was going to be 
daily, consider every other day or weekly.  Less snapshotting means less 
cow1s and thus directly affects how quickly fragmentation becomes a 
problem.  Again, dedicated subvolumes can help here, allowing you to 
snapshot the nocow files on a different schedule than you do the up-
hierarchy parent subvolume.  Second, schedule periodic manual defrags of 
the nocow files, so the fragmentation that does occur is at least kept 
manageable.  If the snapshotting is daily, consider weekly or monthly 
defrags.  If it's weekly, consider monthly or quarterly defrags.  Again, 
various people who do need to snapshot their nocow files have reported 
that this really does help, keeping fragmentation to at least some sanely 
managed level.

That's the snapshot vs. nocow problem in general.  With luck, however, 
you can avoid snapshotting the files in question entirely, thus factoring 
this issue out of the equation entirely.

Now to the ssd issue.

On ssds in general, there are two very major differences we need to 
consider vs. spinning rust.  One, fragmentation isn't as much of a 
problem as it is on spinning rust.  It's still worth keeping to a 
minimum, because as the number of fragments increases, so does both btrfs 
and device overhead, but it's not the nearly everything-overriding 
consideration that it is on spinning rust.

Two, ssds have a limited write-cycle factor to consider, where with 
spinning rust the write-cycle limit is effectively infinite... at least 
compared to the much lower limit of ssds.

The weighing of these two overriding