On 2017年08月16日 21:12, Chris Mason wrote:
On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote:
On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote:
Quite a few applications actually _do_ have some degree of secondary
verification or protection from a crash.  Go look at almost any
database
software.
Then please give proper references for this!

This is from 2015, where you claimed this already and I looked up all
the bigger DBs and they either couldn't do it at all, didn't to it per
default, or it required application support (i.e. from the programs
using the DB)
https://www.spinics.net/lists/linux-btrfs/msg50258.html


It usually will not have checksumming, but it will almost
always have support for a journal, which is enough to cover the
particular data loss scenario we're talking about (unexpected
unclean
shutdown).

I don't think we talk about this:
We talk about people wanting checksuming to notice e.g. silent data
corruption.

The crash case is only the corner case about what happens then if data
is written correctly but csums not.

We use the crcs to catch storage gone wrong, both in terms of simple things like cabling, bus errors, drives gone crazy or exotic problems like every time I reboot the box a handful of sectors return EFI partition table headers instead of the data I wrote. You don't need data center scale for this to happen, but it does help...

So, we do catch crc errors in prod and they do keep us from replicating bad data over good data. Some databases also crc, and all drives have correction bits of of some kind. There's nothing wrong with crcs happening at lots of layers.

Btrfs couples the crcs with COW because it's the least complicated way to protect against:

* bits flipping
* IO getting lost on the way to the drive, leaving stale but valid data in place * IO from sector A going to sector B instead, overwriting valid data with other valid data.

It's possible to protect against all three without COW, but all solutions have their own tradeoffs and this is the setup we chose. It's easy to trust and easy to debug and at scale that really helps.

In general, production storage environments prefer clearly defined errors when the storage has the wrong data. EIOs happen often, and you want to be able to quickly pitch the bad data and replicate in good data.

Btrfs csum is really good, specially for case like RAID1/5/6 where csum can provide extra info about which mirror/stripe/parity can be trusted, with minimal space wasted.

DM layer should really have the ability to verify its data at that timing like btrfs.


My real goal is to make COW fast enough that we can leave it on for the database applications too.

Yes, most of the complexity of nodatasum/nodatacow comes from those special workload.

BTW, when Fujitsu tested the postgresql workload on btrfs, the result is quite interesting.

For HDD, when number of clients is low, btrfs shows obvious performance drop. And the problem seems to be mandatory metadata COW, which leads to superblock FUA updates. And when number of clients grow, difference between btrfs and other fses gets much smaller, the bottleneck is the HDD itself.

While for SSD, when number of clients is low, btrfs is almost the same performance as other fses, nodatacow/nodatasum only provides marginal difference.
But when number of clients grows, btrfs falls far behind other fses.
The reason seems to be related to how postgresql commit its transaction, which always fsync its journal sequentially without concurrency. While Btrfs needs to wait its data write before updating its log tree, this makes most of its time wasted on waiting data IO. In that case, nodatacow does improves the performance, by allowing btrfs to update its log tree without waiting data IO.

But in both case, CoW itself, like allocating new extent, or calculating csum, is not the main cause to slow down btrfs.
That's to say, nodatacow is not as important as we used to think.

If we can get rid of nodatacow/nodatasum, there will be much less thing to consider for us developers, and less related bugs.

Thanks,
Qu

Obviously I haven't quite finished that one yet ;) But I'd rather keep the building block of all the other btrfs features in place than try to do crcs differently.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to